<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8236580214478682937</id><updated>2012-02-16T18:17:55.636-08:00</updated><category term='hhp'/><title type='text'>Another Data Mining Blog</title><subtitle type='html'>A blog about playing with data and other maybe unrelated stuff</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>37</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-7832553919273612292</id><published>2012-01-01T19:04:00.000-08:00</published><updated>2012-01-02T04:40:03.441-08:00</updated><title type='text'>Useful R Snippets</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Every time I need to do something in R it nearly always means I have to do a Google search or trawl my previous code to see how I did it before. &amp;nbsp;Here I am going to post some snippets of code - mainly for my own use so that I know where to find them. Much of this code will be 'borrowed' and probably not the most efficient (I like to write code the long way so I can follow what is going on) - but it seems to work. If anyone finds it doesn't work or there is a more efficient way then please let me know.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;b style="font-family: Arial, Helvetica, sans-serif;"&gt;1. Randomly sampling data into a train and test set&lt;/b&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;totalrecords &amp;lt;- nrow(mydata)&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;trainfraction = 0.7&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;trainrecords = as.integer(totalrecords * trainfraction)&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;allrows &amp;lt;- 1:&lt;/span&gt;&lt;span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;totalrecords&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;trainrows &amp;lt;- sample(totalrecords,trainrecords)&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;testrows &amp;nbsp;&amp;lt;- allrows[-trainrows]&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #38761d; font-family: Arial, Helvetica, sans-serif; font-size: x-small; white-space: pre;"&gt;#check&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="background-color: white; white-space: pre;"&gt;&lt;span style="color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;length(trainrows)  &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="background-color: white; white-space: pre;"&gt;&lt;span style="color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;length(testrows)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #38761d; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;#then build model, something like...&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;model &amp;lt;- lm(theFormula, data=mydata[trainrows,])&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;b style="font-family: Arial, Helvetica, sans-serif;"&gt;1a. Randomly sampling data into a train and test set&lt;/b&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Thanks to Isamoor&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; color: #cc0000; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 12px; line-height: 14px; text-align: -webkit-auto;"&gt;trainrows &amp;lt;- runif(nrow(mydata)) &amp;gt; 0.7&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #cc0000;"&gt;&lt;span style="background-color: white; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 12px; line-height: 14px; text-align: -webkit-auto;"&gt;testrows &amp;lt;- !trainrows&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 12px; line-height: 14px; text-align: -webkit-auto;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: white; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 12px; line-height: 14px; text-align: -webkit-auto;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="background-color: white; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; line-height: 14px; text-align: -webkit-auto;"&gt;&lt;b&gt;2. Repeated n fold cross validation&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="text-align: -webkit-auto;"&gt;&lt;span style="font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif;"&gt;&lt;br /&gt;&lt;span style="line-height: 14px;"&gt;This is to generate a cross-validation set, useful when wanting to know the expected error or for generating a set to use for getting ensemble weightings.&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;######################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# the error function&lt;/span&gt;&lt;br /&gt;calc_RMSE &amp;lt;- function(act,pred){&lt;br /&gt;    &lt;br /&gt;    aact &amp;lt;- as.matrix(act)&lt;br /&gt;    ppred &amp;lt;- as.matrix(pred)&lt;br /&gt;    &lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(nrow(aact) == nrow(ppred)){ &lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;return&lt;/span&gt; (&lt;span style="color: #7f0055; font-weight: bold;"&gt;sqrt&lt;/span&gt;(sum(((ppred) - (aact)) ^ 2) / nrow(aact)))&lt;br /&gt;    } &lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt; {&lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;return&lt;/span&gt; (-99)&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#####################################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#Load and prepare data&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################&lt;/span&gt;&lt;br /&gt;databuild &amp;lt;- iris&lt;br /&gt;datascore &amp;lt;- iris &lt;span style="color: #3f7f59;"&gt;#put real score set here&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#target - what we are predicting&lt;/span&gt;&lt;br /&gt;theTarget &amp;lt;- &lt;span style="color: #2a00ff;"&gt;'Sepal.Length'&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#set the formula&lt;/span&gt;&lt;br /&gt;theFormula &amp;lt;- as.formula(paste(theTarget,&lt;span style="color: #2a00ff;"&gt;" ~ . "&lt;/span&gt;))&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#find the position of the target&lt;/span&gt;&lt;br /&gt;targindex &amp;lt;-  which(names(databuild)==theTarget)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#actuals&lt;/span&gt;&lt;br /&gt;build_actuals &amp;lt;- databuild[,targindex]&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#######################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#vectors to score the model outputs&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#######################################&lt;/span&gt;&lt;br /&gt;buildcases &amp;lt;- nrow(databuild)&lt;br /&gt;scorecases &amp;lt;- nrow(datascore)&lt;br /&gt;&lt;br /&gt;pred_train &amp;lt;- vector(&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;=buildcases)&lt;br /&gt;pred_test &amp;lt;- vector(&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;=buildcases)&lt;br /&gt;pred_score &amp;lt;- vector(&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;=scorecases)&lt;br /&gt;&lt;br /&gt;pred_trainLoop &amp;lt;- vector(&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;=buildcases)&lt;br /&gt;pred_testLoop &amp;lt;- vector(&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;=buildcases)&lt;br /&gt;pred_scoreLoop &amp;lt;- vector(&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;=scorecases)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#settings&lt;/span&gt;&lt;br /&gt;numloops &amp;lt;- 300&lt;br /&gt;numfolds &amp;lt;- 10&lt;br /&gt;&lt;br /&gt;test_errors &amp;lt;- vector(&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;=numloops)&lt;br /&gt;train_errors &amp;lt;- vector(&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;=numloops)&lt;br /&gt;&lt;br /&gt;pred_testLoop &amp;lt;- 0&lt;br /&gt;pred_trainLoop &amp;lt;- 0&lt;br /&gt;pred_scoreLoop &amp;lt;- 0&lt;br /&gt;    &lt;br /&gt;modtype = &lt;span style="color: #2a00ff;"&gt;'linear regression'&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#####################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# now the work&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#####################################        &lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;for&lt;/span&gt;(loop in 1:numloops){&lt;br /&gt;&lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;# generate the indicies for each fold    &lt;/span&gt;&lt;br /&gt;    id &amp;lt;- sample(rep(seq_len(numfolds), &lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;.out=buildcases))&lt;br /&gt;&lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;# lapply over them:&lt;/span&gt;&lt;br /&gt;    indicies &amp;lt;- lapply(seq_len(numfolds), function(a) list(&lt;br /&gt;        test = which(id==a),&lt;br /&gt;        train = which(id!=a)&lt;br /&gt;    ))&lt;br /&gt;    &lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;#reset the predictions for this loop&lt;/span&gt;&lt;br /&gt;    pred_train &amp;lt;- 0&lt;br /&gt;    pred_test &amp;lt;- 0&lt;br /&gt;    pred_score &amp;lt;- 0&lt;br /&gt;    &lt;br /&gt;        &lt;span style="color: #7f0055; font-weight: bold;"&gt;for&lt;/span&gt;(fold in 1:numfolds){&lt;br /&gt;            &lt;br /&gt;            &lt;span style="color: #3f7f59;"&gt;#set the cases for this fold&lt;/span&gt;&lt;br /&gt;            rows_train &amp;lt;- indicies[[fold]]$train&lt;br /&gt;            rows_test  &amp;lt;- indicies[[fold]]$test&lt;br /&gt;&lt;br /&gt;            &lt;span style="color: #3f7f59;"&gt;#build the models - use any model&lt;/span&gt;&lt;br /&gt;            model &amp;lt;- lm(theFormula, data=databuild[rows_train,])&lt;br /&gt;&lt;br /&gt;            &lt;span style="color: #3f7f59;"&gt;#score up the model&lt;/span&gt;&lt;br /&gt;            buildPred &amp;lt;- predict(model, databuild, type=&lt;span style="color: #2a00ff;"&gt;"response"&lt;/span&gt;)&lt;br /&gt;            scorepred &amp;lt;- predict(model, datascore, type=&lt;span style="color: #2a00ff;"&gt;"response"&lt;/span&gt;)&lt;br /&gt;            &lt;br /&gt;            &lt;span style="color: #3f7f59;"&gt;#now score the cv and scoring predictions&lt;/span&gt;&lt;br /&gt;            z &amp;lt;- buildPred&lt;br /&gt;            z[rows_test] &amp;lt;- 0&lt;br /&gt;            pred_train &amp;lt;- pred_train + z&lt;br /&gt;            pred_test[rows_test] &amp;lt;- buildPred[rows_test]&lt;br /&gt;            pred_score &amp;lt;- pred_score + scorepred&lt;br /&gt;        } &lt;span style="color: #3f7f59;"&gt;#next fold&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;    &lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;#average the predictions on the train set&lt;/span&gt;&lt;br /&gt;    pred_train &amp;lt;- pred_train / (numfolds - 1)&lt;br /&gt;    pred_score &amp;lt;- pred_score / numfolds&lt;br /&gt;    &lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;#add to previous loop results&lt;/span&gt;&lt;br /&gt;    pred_trainLoop &amp;lt;- pred_trainLoop + pred_train&lt;br /&gt;    pred_testLoop &amp;lt;- pred_testLoop + pred_test&lt;br /&gt;    pred_scoreLoop &amp;lt;- pred_scoreLoop + pred_score&lt;br /&gt;    &lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;#calculate the errors    &lt;/span&gt;&lt;br /&gt;    train_errors[loop] &amp;lt;- calc_RMSE(build_actuals,pred_trainLoop / loop)&lt;br /&gt;    test_errors[loop] &amp;lt;- calc_RMSE(build_actuals,pred_testLoop / loop)&lt;br /&gt;    &lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;#report&lt;/span&gt;&lt;br /&gt;    cat(&lt;span style="color: #2a00ff;"&gt;"&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\n&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;loop = "&lt;/span&gt;,loop,&lt;span style="color: #2a00ff;"&gt;"train error = "&lt;/span&gt;,train_errors[loop],&lt;span style="color: #2a00ff;"&gt;"cv error = "&lt;/span&gt;,test_errors[loop]) &lt;br /&gt;&lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;#plot a chart as we go&lt;/span&gt;&lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(loop&amp;gt;1){&lt;br /&gt;         plot(test_errors[1:loop],col=&lt;span style="color: #2a00ff;"&gt;'blue'&lt;/span&gt;,type=&lt;span style="color: #2a00ff;"&gt;'l'&lt;/span&gt;,main = paste(modtype,numloops,&lt;span style="color: #2a00ff;"&gt;'by'&lt;/span&gt;,numfolds,&lt;span style="color: #2a00ff;"&gt;'-fold cross validation'&lt;/span&gt;), xlab = &lt;span style="color: #2a00ff;"&gt;'Repetitions'&lt;/span&gt;, ylab = &lt;span style="color: #2a00ff;"&gt;'RMSE'&lt;/span&gt;,ylim = range(rbind(test_errors[1:loop],train_errors[1:loop])))&lt;br /&gt;        abline(h=test_errors[loop],col=&lt;span style="color: #2a00ff;"&gt;'blue'&lt;/span&gt;)&lt;br /&gt;        points(train_errors[1:loop],type=&lt;span style="color: #2a00ff;"&gt;'l'&lt;/span&gt;,col=&lt;span style="color: #2a00ff;"&gt;'red'&lt;/span&gt;)&lt;br /&gt;        abline(h=train_errors[loop],col=&lt;span style="color: #2a00ff;"&gt;'red'&lt;/span&gt;)&lt;br /&gt;        legend(&lt;span style="color: #2a00ff;"&gt;'top'&lt;/span&gt;,c(&lt;span style="color: #2a00ff;"&gt;'test'&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;'train'&lt;/span&gt;),col=c(&lt;span style="color: #2a00ff;"&gt;'blue'&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;'red'&lt;/span&gt;),lty=1)&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;} &lt;span style="color: #3f7f59;"&gt;#loop&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;########################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;#the cross validation predictions and scoring set predictions&lt;/span&gt;&lt;br /&gt;     &lt;span style="color: #3f7f59;"&gt;#this is what we are after&lt;/span&gt;&lt;br /&gt;    cvPredictions &amp;lt;- pred_testLoop / numloops&lt;br /&gt;    scPredictions &amp;lt;- pred_scoreLoop / numloops&lt;br /&gt;&lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;#plot should show decreasing test error with increasing train error&lt;/span&gt;&lt;br /&gt;    plot(train_errors,test_errors,type=&lt;span style="color: #2a00ff;"&gt;'p'&lt;/span&gt;)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-7832553919273612292?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/7832553919273612292/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2012/01/useful-r-snippets.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7832553919273612292'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7832553919273612292'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2012/01/useful-r-snippets.html' title='Useful R Snippets'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-8873115493842170414</id><published>2011-12-26T13:39:00.000-08:00</published><updated>2011-12-26T13:39:35.289-08:00</updated><title type='text'>$3 million - hmmm...</title><content type='html'>I've shown this plot a couple of times already, but here it is again, with another that shows the bar we have to limbo under to win $3 million.&lt;br /&gt;&lt;br /&gt;Anyone think this is possible?&lt;br /&gt;&lt;br /&gt;Any predictions of what the final winning error will be?&lt;br /&gt;&lt;br /&gt;I predict 0.453&lt;br /&gt;&lt;br /&gt;Place your bets here...&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-rGQc0lVyPvI/Tvjmahadu_I/AAAAAAAAAKc/7tH8hl2VtrY/s1600/leaderboardthroughtimeDEC2011.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="320" width="320" src="http://2.bp.blogspot.com/-rGQc0lVyPvI/Tvjmahadu_I/AAAAAAAAAKc/7tH8hl2VtrY/s320/leaderboardthroughtimeDEC2011.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-e8q9o4TseCs/TvjmanFDoRI/AAAAAAAAAKk/n8lag5te2zw/s1600/leaderboardthroughtimeDEC2011%2B3million.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="320" width="320" src="http://4.bp.blogspot.com/-e8q9o4TseCs/TvjmanFDoRI/AAAAAAAAAKk/n8lag5te2zw/s320/leaderboardthroughtimeDEC2011%2B3million.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-8873115493842170414?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/8873115493842170414/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/12/3-million-hmmm.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8873115493842170414'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8873115493842170414'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/12/3-million-hmmm.html' title='$3 million - hmmm...'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-rGQc0lVyPvI/Tvjmahadu_I/AAAAAAAAAKc/7tH8hl2VtrY/s72-c/leaderboardthroughtimeDEC2011.png' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-8192301788125019898</id><published>2011-12-17T13:54:00.000-08:00</published><updated>2011-12-17T14:12:40.958-08:00</updated><title type='text'>What's Going On Here</title><content type='html'>In many of the analytics problems I have been involved in, the problem you end up dealing with is not the one you initially were briefed to solve.&lt;br /&gt;&lt;br /&gt;These new problems are always discovered by visualising the data in some way and spotting curious patterns.&lt;br /&gt;&lt;br /&gt;Here are a three of examples...&lt;br /&gt;&lt;br /&gt;&lt;b&gt;1. Algorithmic Trading Challenge&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://www.kaggle.com/c/AlgorithmicTradingChallenge"&gt;Algorithmic Trading Challenge&lt;/a&gt; is based on data from the London Stock Exchange and is about things called 'Liquidity Shocks'. I know nothing about these but we had data, so the first thing I did was plot a few graphs to see id I could get a better understanding of things.&lt;br /&gt;&lt;br /&gt;The plot below shows the times these 'Liquidity Shocks' occur.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-VT2HrjxvaTk/TuzliKgtWCI/AAAAAAAAAJs/hRUw7Hf3Y2o/s1600/shock%2Bevent%2Btimings.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="202" width="320" src="http://4.bp.blogspot.com/-VT2HrjxvaTk/TuzliKgtWCI/AAAAAAAAAJs/hRUw7Hf3Y2o/s320/shock%2Bevent%2Btimings.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Now it is quite clear there is something going on at 1pm, 2:30pm, after 3:30pm and at 4pm.&lt;br /&gt;&lt;br /&gt;Interestingly these spikes are only evident when all commodities are looked at together, they are not as obvious in any individual commodity.&lt;br /&gt;&lt;br /&gt;My first question if I was solving a business problem would be to return to the business to get more insight in what was going on here. My initial thoughts were lunch breaks and the opening times of other Stock Exchanges around the world - as 3:30pm London time could be around opening time in New York.&lt;br /&gt;&lt;br /&gt;Understanding the cause of these peaks is important as you would expect the reaction to them (the problem to solve) to be a function of the cause. &lt;br /&gt;&lt;br /&gt;If we did discover it was the opening times of other exchanges, then I would ask for extra information like the specific dates, so I could calculate when these peaks would occur in the future when the clocks changed. We do not have this information at the current time, or even the day of the week (it can be inferred but not accurately as there will be public holidays when the exchanges are closed) &lt;br /&gt;&lt;br /&gt;As it stands any models built could potentially fail on the leaderboard (or real life) data as our model might think 2:30pm is a special time, wheras really it is when another exchange opens, or when people come back from lunch. We need this causal information rather than just dealing with the effect - time differences change - lunch breaks may change.&lt;br /&gt;&lt;br /&gt;The current competition data is potentially lacking the full information required to build a model that is as robust as possible over time.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;2. Interesting Distributions&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;One of the first things I do when receiving a data set is to scan the distributions of all variables to sanity check them for anything that looks out of place - but still things can sneak past you.&lt;br /&gt;&lt;br /&gt;The following is exam mark data in the range 0-100. If we bin it in 20 bins then things look reasonable, but if we zoom in then we get the 'what is going on here' question again. It is quite clear what is going on, but if exam marks is the thing we are trying to predict, how do we deal with this phenomenon and how would our algorithm cope looking at it blindly? And what if the pass mark changed or rules changes - the algorithm would fail. Again, we need to be aware of the underlying root cause and not just the effect.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-Rc_bjhXSSQU/TuzwpywQJPI/AAAAAAAAAJ4/4bNHBSGxql8/s1600/A1.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="162" width="320" src="http://2.bp.blogspot.com/-Rc_bjhXSSQU/TuzwpywQJPI/AAAAAAAAAJ4/4bNHBSGxql8/s320/A1.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-vfFNvMrfV3Y/TuzwqKcQ8mI/AAAAAAAAAKA/6RRp38DQKmQ/s1600/A2.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="165" width="320" src="http://3.bp.blogspot.com/-vfFNvMrfV3Y/TuzwqKcQ8mI/AAAAAAAAAKA/6RRp38DQKmQ/s320/A2.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;3. Don't Get Kicked&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This is another &lt;a href="http://www.kaggle.com/c/DontGetKicked"&gt;Kaggle Competition...&lt;/a&gt; &lt;br /&gt;&lt;br /&gt;&lt;i&gt;Kicked cars often result when there are tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller, or some other unforeseen problem. Kick cars can be very costly to dealers after transportation cost, throw-away repair work, and market losses in reselling the vehicle.&lt;br /&gt;&lt;br /&gt;Modelers who can figure out which cars have a higher risk of being kick can provide real value to dealerships trying to provide the best inventory selection possible to their customers.&lt;br /&gt;&lt;br /&gt;The challenge of this competition is to predict if the car purchased at the Auction is a Kick (bad buy)&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;This is a binary classification task and a quick way to spot data issues with this type of problem is to throw it in a decision tree in order to spot what are called 'gimmees'. These are cases that are easily perfectly predictable and are more than often a result of giving prediction data that just shouldn't be there as it is not known at the time (future information) - an extraction issue that would result in a useless model (It is common that people think they have built really good predictive models using future information without really questioning why their models are so good!).&lt;br /&gt;&lt;br /&gt;Another reason 'gimmees' occur are poorly defined target variables, that is not excluding certain cases (and example in target marketing would be not excluding dead people from your mailing list and then predicting they won't respond to your offer!)&lt;br /&gt;&lt;br /&gt;After a bit of data prep I threw the Don't Get Kicked Data into a Tiberius Decision Tree - the visual below immediately tells me there are clear cut cases of cars that will be kicked - it is almost black and white.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-Q0d99nyw0dY/Tuz7_oYV62I/AAAAAAAAAKQ/RHeHtKz2uSI/s1600/tree.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="230" width="320" src="http://2.bp.blogspot.com/-Q0d99nyw0dY/Tuz7_oYV62I/AAAAAAAAAKQ/RHeHtKz2uSI/s320/tree.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;These 'gimmees' can be described by the rules...&lt;br /&gt;&lt;br /&gt;[WheelTypeID] = 'NULL' AND [Auction] &lt;&gt; 'MANHEIM'&lt;br /&gt;&lt;br /&gt;MANHEIM is an auctioneering company  where cars are auctioned - there are 2 main auctioneers in the data set plus 'other'.&lt;br /&gt;&lt;br /&gt;Having worked extensively with car auction data before I know that there are certain auctions where only 'write off' cars are sold, that is those that are sold for scrap because they have been in accidents. I also know that different auction houses will record data differently.&lt;br /&gt;&lt;br /&gt;The above simple rule easily identifies cars that are more than likely going to be 'knocked' - but this is probably because they are 'knocked' in the first place (are we saying that someone in a coma is more likely to die). Is this useful? Is this a poorly defined definition of what is 'knocked'? Why does a missing value for WheelTypeID make such a big difference between auction houses?&lt;br /&gt;&lt;br /&gt;A bit more digging reveals location and the specific buyer drills down on these gimmees even more...&lt;br /&gt;&lt;br /&gt;[WheelTypeID] = 'NULL' AND [Auction] &lt;&gt; 'MANHEIM' AND [VNST] in ('NC','AZ') AND [BYRNO] NOT IN (99750,99761)&lt;br /&gt;&lt;br /&gt;and after excluding these 'gimmees' it becomes clear there are certain buyers that just don't but knocked cars, especially 99750 and 99761...&lt;br /&gt;&lt;br /&gt;byrno = 99750 and VNST in ('SC','NC','UT','ID','PA','WV','MO','WA')&lt;br /&gt;byrno = 99761 and Auction = 'MANHEIM'&lt;br /&gt;byrno = 99761 and MAKE = 'SUZUKI'&lt;br /&gt;byrno = 99761 and SIZE = 'VAN'&lt;br /&gt;byrno = 99761 and VNST IN ('FL','VA') &lt;br /&gt;&lt;br /&gt;Now is this actually useful?   &lt;br /&gt;&lt;br /&gt;&lt;i&gt;The challenge of this competition is to predict if the car purchased at the Auction is a Kick (bad buy)&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The model is going to focus on who bought the car rather than the characteristics of the car itself. What happens if buyers suddenly change their policy? Wouldn't we rather just go and speak to these buyers to understand what their policy is and hence get some business understanding? Why is specific auction house location so important? Is it because of the specific auction house itself or that specific cars are actually routed to specific places (this does happen).&lt;br /&gt;&lt;br /&gt;Basically if this was a real client engagement I would be going back to them with a lot of questions to help me understand the data better so it can be used in a way that is going to be useful to them. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;In Summary&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;When doing predictive modelling, you can throw the latest hot algorithm at a problem such as a GBM, Neural Net or Random Forest and get impressive results, but unless you thoroughly understand and account for the real dynamics of what is going on then the models could disastrously fail when these dynamics change. I find visualisation the key to spotting and interpreting these dynamics - which is why I would rather have a good data miner who knows what he is doing using free software over a poor data miner with the most expensive software - see http://analystfirst.com/analyst-first-101/&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-8192301788125019898?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/8192301788125019898/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/12/whats-going-on-here.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8192301788125019898'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8192301788125019898'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/12/whats-going-on-here.html' title='What&apos;s Going On Here'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-VT2HrjxvaTk/TuzliKgtWCI/AAAAAAAAAJs/hRUw7Hf3Y2o/s72-c/shock%2Bevent%2Btimings.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-7069955136294258823</id><published>2011-12-16T01:58:00.000-08:00</published><updated>2011-12-16T17:35:51.288-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hhp'/><title type='text'>Two Become One</title><content type='html'>In the previous post I looked at the HHP leaderboard and discovered some interesting patterns regarding certain teams.&lt;br /&gt;&lt;br /&gt;It looks as the evidence proved out to be true, with SD_John and Lily now all of a sudden merging into a single team.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-T3tpIELdp5k/TusQupwh3wI/AAAAAAAAAIw/aFJyb-LQtAs/s1600/SDJOHNLILY_MERGE.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="118" width="320" src="http://1.bp.blogspot.com/-T3tpIELdp5k/TusQupwh3wI/AAAAAAAAAIw/aFJyb-LQtAs/s320/SDJOHNLILY_MERGE.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Interestingly they have also been in other competitions with very similar results.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-K7j6q-spSss/TusReCMO4kI/AAAAAAAAAI8/wtOBG9mDGx8/s1600/JOHN_TEAM.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="222" width="320" src="http://4.bp.blogspot.com/-K7j6q-spSss/TusReCMO4kI/AAAAAAAAAI8/wtOBG9mDGx8/s320/JOHN_TEAM.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-l5d6Kqyi-UI/TusReYCU22I/AAAAAAAAAJI/oo0T30sJRjk/s1600/LILY_TEAM.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="225" width="320" src="http://3.bp.blogspot.com/-l5d6Kqyi-UI/TusReYCU22I/AAAAAAAAAJI/oo0T30sJRjk/s320/LILY_TEAM.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;This was the final standing in the Give Me Some Credit competiton,&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-mtuWwqz_uoA/TusRenh0_bI/AAAAAAAAAJU/ZEypXe9Yv7I/s1600/GIVEMESOMECREDIT.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="136" width="320" src="http://4.bp.blogspot.com/-mtuWwqz_uoA/TusRenh0_bI/AAAAAAAAAJU/ZEypXe9Yv7I/s320/GIVEMESOMECREDIT.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;What is actually more interesting here is the demonstration of overfitting to the leaderboard. Opera Solutions &amp; JYL are more than likely working together and we know Lily &amp; SD_John are working together. If you look at the leaderboard just before the competition ended (on the 30%) you will see Opera near the top but the final position on the 70% was much worse. Similarly a few others found that relying on the leaderboard as an indication of the final position can be misplaced trust.&lt;br /&gt;&lt;br /&gt;If &lt;a href="http://www.kaggle.com/c/GiveMeSomeCredit/forums/t/1153/magic-team-migration"&gt;you followed the competition forum&lt;/a&gt;, you will see team VSU also had multiple accounts for the same person, and they seem to have also fallen into the same trap of overfitting to the leaderboard - they ended up 9th on the 70% when they were first on the 30%.&lt;br /&gt;&lt;br /&gt;The data mining lesson here is that you need to take all necessary steps to avoid overfitting, rather than just relying on the leaderboard feedback.&lt;br /&gt;&lt;br /&gt;Congratulations to Nathaniel, Eu Jin (small world - I used to work with Nathaniel at the National Australia Bank and regularly see Eu Jin at the Melbourne R user group) and Alec, who clearly did not overfit. A Perfect Storm!&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-yTgBuR_9hjo/TusUhWe5Q1I/AAAAAAAAAJg/-EH898AI7Ns/s1600/credit%2Brisk%2Bleaderboard%2Bopera.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="320" width="314" src="http://4.bp.blogspot.com/-yTgBuR_9hjo/TusUhWe5Q1I/AAAAAAAAAJg/-EH898AI7Ns/s320/credit%2Brisk%2Bleaderboard%2Bopera.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-7069955136294258823?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/7069955136294258823/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/12/two-become-one.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7069955136294258823'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7069955136294258823'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/12/two-become-one.html' title='Two Become One'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-T3tpIELdp5k/TusQupwh3wI/AAAAAAAAAIw/aFJyb-LQtAs/s72-c/SDJOHNLILY_MERGE.PNG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-7869239410549189343</id><published>2011-12-14T13:01:00.000-08:00</published><updated>2011-12-14T17:12:06.463-08:00</updated><title type='text'>Phantom of the Opera</title><content type='html'>There have been some &lt;a href="http://www.heritagehealthprize.com/c/hhp/forums/t/1155/rules-and-terms-conditions/7112#post7112"&gt;recent announcements on Kaggle&lt;/a&gt; reminding competitors about the rules regarding teams and that a single person can't have muliple accounts in order to get around the daily submission limit.&lt;br /&gt;&lt;br /&gt;I used the HHP leaderboard as an interesting data source to educate myself on the data manipulation capabilities in R and it became very evident that there was some curious behaviour going on.&lt;br /&gt;&lt;br /&gt;From a data scientist viewpoint, this demonstrates the power of the human eye in picking up things that will give you the insight that an algorithm won't. In most (probably all) of my professional projects the important data issues and findings have been a result of looking at visualisations of the data and asking the question "what's going on here!".&lt;br /&gt;&lt;br /&gt;The first curiosity on the leaderboard was by trying to discover if the competition was attracting new entrants by looking at the dates of the first submissions of entrants. The two plots below show different ways of looking at the same data. What is obvious is that the 29th Nov had an unusual number of new entrants.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-UaJMP6G4Vto/Tuj9rc0Gj8I/AAAAAAAAAG4/ioTd6GW7tfE/s1600/first%2Bsubmission%2Bdates.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="246" width="320" src="http://2.bp.blogspot.com/-UaJMP6G4Vto/Tuj9rc0Gj8I/AAAAAAAAAG4/ioTd6GW7tfE/s320/first%2Bsubmission%2Bdates.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;and looked at in another way...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-K2rekL9YhrI/Tuj98FY_8iI/AAAAAAAAAHE/LsNj1qIXcYo/s1600/first%2Bsubmission%2Bdates%2Bcumulative.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="246" width="320" src="http://2.bp.blogspot.com/-K2rekL9YhrI/Tuj98FY_8iI/AAAAAAAAAHE/LsNj1qIXcYo/s320/first%2Bsubmission%2Bdates%2Bcumulative.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;What's going on here?&lt;br /&gt;&lt;br /&gt;If you look at the team name of the entrants it is clear that all these accounts are somewhat connected - so no real mystery as to the cause of the blip for this date.&lt;br /&gt;&lt;br /&gt;"accnt002"    "accnt003"    "accnt004"    "accnt005"    "accnt006"   &lt;br /&gt;"accnt007"    "accnt008"    "accnt009"    "cyclops"     "Faber"      &lt;br /&gt;"Farbe"       "Fortis"      "glad5"       "glad55"      "gladiator"  &lt;br /&gt;"gladiator1"  "gladiator2"  "gladiator3"  "jackie"      "Kaggleacctk"&lt;br /&gt;"KaggleK2"    "sashik"   &lt;br /&gt;&lt;br /&gt;The next two plots show the scores of the first submission of teams.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-OYUY2cvdw9Q/TukBVeU8L1I/AAAAAAAAAHQ/vQfKtOnET9Y/s1600/first%2Bscores%2B1.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="246" width="320" src="http://1.bp.blogspot.com/-OYUY2cvdw9Q/TukBVeU8L1I/AAAAAAAAAHQ/vQfKtOnET9Y/s320/first%2Bscores%2B1.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-KQ5bJh7kr0M/TukBeB2NdtI/AAAAAAAAAHc/Nj6YBElxdZM/s1600/first%2Bscores%2B2.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="246" width="320" src="http://1.bp.blogspot.com/-KQ5bJh7kr0M/TukBeB2NdtI/AAAAAAAAAHc/Nj6YBElxdZM/s320/first%2Bscores%2B2.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;What's going on here?&lt;br /&gt;&lt;br /&gt;The common scores where the steps are seen are the all zeros benchmark, optimised constant benchmark and the code we posted in our writeup - so this is explained. There is another common first score which is another very simple model that many teams independently thought of.&lt;br /&gt;&lt;br /&gt;What does raise an eyebrow from the cumulative plot is one team stands out as having a very impressive first score. This is team YARONG who posted a very impressive model of 0.457698 on the first attempt and it still remains their best score 22 attempts later. This is possible (you don't need to submit models to blend them if you have your own holdout set - see the IBM writeup in the KDD Cup Orange Challenge) but somewhat unlikely as we know from the writeups that an individual model will get you no where near this score.&lt;br /&gt;&lt;br /&gt;If you look at the dates teams submit and look at some sort of correlation of entry dates, one team appears twice towards the top - SD_John, and they are also at the top of the leaderboard.  &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;td.row                      td.col pairs    correl&lt;br /&gt;UCI-CS273A-RegAll                    Alex_Tot    27 0.9979902&lt;br /&gt;rutgers                  HappyAcura    29 0.9978254&lt;br /&gt;SD_John                        lily    34 0.9974190&lt;br /&gt;Roger99                 Krakozjabra    21 0.9956643&lt;br /&gt;SD_John                         JYL    24 0.9950884&lt;br /&gt;The_Cuckoo's_Nest                 NumberNinja    23 0.9931073&lt;br /&gt;NumberNinja                     Chris_R    29 0.9924864&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;What's is going on here?&lt;br /&gt;&lt;br /&gt;If you plot the submissions and scores you will see SD_John and Lily seem to perfectly track each other in both the days they submit, the times they submit and the scores they get.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-ZdE8JOHPpCI/TukHK50OjKI/AAAAAAAAAH0/Bhxm1VcWpFA/s1600/opera%2Bsubmissions.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="246" width="320" src="http://1.bp.blogspot.com/-ZdE8JOHPpCI/TukHK50OjKI/AAAAAAAAAH0/Bhxm1VcWpFA/s320/opera%2Bsubmissions.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;And on one particular day they get exactly the same score within 5 minutes of each other...&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-BXsx1arTxiw/TukJTNSXx3I/AAAAAAAAAIA/10lpGm3uaLA/s1600/lily%2Bsd%2Bjohn.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="231" width="320" src="http://3.bp.blogspot.com/-BXsx1arTxiw/TukJTNSXx3I/AAAAAAAAAIA/10lpGm3uaLA/s320/lily%2Bsd%2Bjohn.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;SD_John and JYL seem to also track each other in submission dates. Interestingly JYL has a very similar profile to a member of Opera, and a little digging would suggest this is one and the same person. &lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-ma1AlWUIw7M/TukKsxRO5LI/AAAAAAAAAIo/4A3X5M-ifmo/s1600/Jeong3.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="282" width="320" src="http://1.bp.blogspot.com/-ma1AlWUIw7M/TukKsxRO5LI/AAAAAAAAAIo/4A3X5M-ifmo/s320/Jeong3.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-K3gC8XA4q7s/TukKsAb4XzI/AAAAAAAAAIM/zcLubqvvjRI/s1600/jeong1.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="215" width="320" src="http://4.bp.blogspot.com/-K3gC8XA4q7s/TukKsAb4XzI/AAAAAAAAAIM/zcLubqvvjRI/s320/jeong1.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-PygCFTtN5UM/TukKseYIqHI/AAAAAAAAAIc/6d43wFzO2E4/s1600/Jeong2.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="320" width="313" src="http://1.bp.blogspot.com/-PygCFTtN5UM/TukKseYIqHI/AAAAAAAAAIc/6d43wFzO2E4/s320/Jeong2.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;So here we can hypothesize that SD_john, lily, JYL and Opera (and evidence also suggests many more teams) are collaborating in some way.&lt;br /&gt;&lt;br /&gt;Interesting - all from following your nose, which is what good data mining is all about.&lt;br /&gt;&lt;br /&gt;In conclusion, the top of the leaderboard is not really what it appears to be - which I hope will encourage others to keep trying.&lt;br /&gt;&lt;br /&gt;The main reason for this investigation was to help me discover what R can do to manipulate data - and the answer is basically anything you want it to do. You first have to know what you want to achieve then do some Googling and you will find some code to help you somewhere.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-7869239410549189343?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/7869239410549189343/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/12/phantom-of-opera.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7869239410549189343'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7869239410549189343'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/12/phantom-of-opera.html' title='Phantom of the Opera'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-UaJMP6G4Vto/Tuj9rc0Gj8I/AAAAAAAAAG4/ioTd6GW7tfE/s72-c/first%2Bsubmission%2Bdates.png' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-1610424384668484942</id><published>2011-11-17T04:18:00.000-08:00</published><updated>2011-11-17T04:27:50.227-08:00</updated><title type='text'>The Pack is Catching Up</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;If you have been keeping an eye on the leaderboard, then you will notice there has been apparently little activity since the milestone 1 deadline. On some occasions there has been no change in the top 40 positions for over one week. This is quite an eerie silence and I suspect there may be a lack of submissions so teams can merge (the total submissions of teams merging has to be less than the number of days the comp has been running).&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;There have also been some very interesting movements if you look closer - more to come on this in a later post.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;If you look further down though, the pack is catching up. In the past six weeks more teams are heading towards the 0.461 mark, which is the point the early leaders got to straight away and was the score to beat. Now it is only a top 50 place.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;This score is a good single model. To improve dramatically from there though it is probably necessary to ensemble various models. What is pretty clear though is that the benchmark of 0.40 for the 3 million is impossible (hopefully this might be adjusted?).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;I tried to put some nice colours in the chart below, which is generated in R, but could not find any up to date listing of colour codes in R. This is one of the disadvantages of the open source movement - documentation is very low on the contributors list of priorities (and what documentation there is leads a lot to be desired if R is to be used by 'regular' types of people).&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;I did find the following link though, which is where I got the colours for the plot,&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://colorbrewer2.org/"&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;http://colorbrewer2.org/&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;click on the image to enlarge&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-nce8044X0XE/TsT1a_3Q6KI/AAAAAAAAAGs/ViU1E_6V3a8/s1600/leaderboard+through+time+16+Nov.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/-nce8044X0XE/TsT1a_3Q6KI/AAAAAAAAAGs/ViU1E_6V3a8/s320/leaderboard+through+time+16+Nov.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-1610424384668484942?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/1610424384668484942/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/11/pack-is-catching-up.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/1610424384668484942'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/1610424384668484942'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/11/pack-is-catching-up.html' title='The Pack is Catching Up'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-nce8044X0XE/TsT1a_3Q6KI/AAAAAAAAAGs/ViU1E_6V3a8/s72-c/leaderboard+through+time+16+Nov.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-182895074757914860</id><published>2011-10-07T15:46:00.000-07:00</published><updated>2011-10-07T15:53:07.608-07:00</updated><title type='text'>If you can't measure it, you can't manage it!</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;"If you can't measure it, you can't manage it!" - this is often quoted in the marketing and analytics world.&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;Apparently &lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;i&gt;"It comes from a remark attributed to GALILEO, who said 'Count what is&lt;br /&gt;countable, measure what is measurable. What is not measurable, make&lt;br /&gt;measurable'."&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;see &lt;a href="http://answers.google.com/answers/threadview?id=139473"&gt;here&lt;/a&gt; for more details on the origins of the quote.&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;Anyway, the point is that if you don't measure the before and after, you have no idea if the actions made a difference.&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;Last week in my &lt;a href="http://anotherdataminingblog.blogspot.com/2011/10/code-for-respectable-hhp-model.html"&gt;previous blog post&lt;/a&gt;, I made available some code that would result in a score of around 0.4635 on the HHP leaderboard. The hope was that others would be able to get this code to work and re-ignite interest in the HHP and solving the worlds health problems.&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;When I looked at the leaderboard today, it looked on the face of it that this was happening, with a few high risers around the 0.4635 mark.&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; font-family: Arial,Helvetica,sans-serif; text-align: center;"&gt;&lt;span style="font-size: small;"&gt;&lt;a href="http://3.bp.blogspot.com/-iqPUioEtqlo/To96nyuaBCI/AAAAAAAAAFo/fU1hk0LdcPc/s1600/leaderboard%2Bafter%2Bcode1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://3.bp.blogspot.com/-iqPUioEtqlo/To96nyuaBCI/AAAAAAAAAFo/fU1hk0LdcPc/s320/leaderboard%2Bafter%2Bcode1.PNG" width="247" /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;This was the after, but the before is not shown, so on this evidence there is a bit of a hole in the claim that the code was responsible for this change - we are only showing the part of the leaderboard that we want to - such movement could be going on all over the place.&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;Good old Jeff Moser at Kaggle has been hard at work making the leaderboard dynamic, so we can actually go back in time. Knowing this I modified a previous R script I posted to plot the leaderboard at 3 points in time, the day I released the code, and a week before and after (the new script can be found at the end of this post).&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;This resulted in the following 2 plots...&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; font-family: Arial,Helvetica,sans-serif; text-align: center;"&gt;&lt;span style="font-size: small;"&gt;&lt;a href="http://2.bp.blogspot.com/-jV4R9N6PfB4/To98uWALh3I/AAAAAAAAAFw/5qPCVxgiWeo/s1600/leaderboardthroughtime1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-jV4R9N6PfB4/To98uWALh3I/AAAAAAAAAFw/5qPCVxgiWeo/s320/leaderboardthroughtime1.png" width="320" /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; font-family: Arial,Helvetica,sans-serif; text-align: center;"&gt;&lt;span style="font-size: small;"&gt;&lt;a href="http://4.bp.blogspot.com/-xzgPvf4uKBA/To9835pDMzI/AAAAAAAAAF4/HNInl72QVc0/s1600/leaderboardthroughtime2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/-xzgPvf4uKBA/To9835pDMzI/AAAAAAAAAF4/HNInl72QVc0/s320/leaderboardthroughtime2.png" width="320" /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;These are now a lot more convincing that the code did make a difference. If you follow Eu Jin Lok on the first plot, you will see he has improved his score but actually gone backwards in rank over the two week period. The second plot shows that the number of submissions to score 0.4635 has dropped.&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;Anyway, the point of this post is that data scientist have the power to make massive differences to the bottom lines of companies. I have been there and done it, but failed to get too excited as I am unassuming and knew exactly what the result would be because my models told me in advance. If analytics - and us analysts - are to be taken more seriously and given the recognition we deserve, we need to start banging our own drum more than we do. Those who get the recognition and rewards are not always those who do the best work, but those who shout the loudest!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: small;"&gt;And here is the R script to generate those plots... &lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;##########################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# Some R function to plot the Kaggle &lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# leaderboard at different points in time&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;##########################################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;plotKaggleLeaderboard &amp;lt;- function(theURL&lt;br /&gt;,theDates&lt;br /&gt;,myTeam&lt;br /&gt;,topX=100&lt;br /&gt;,title&lt;br /&gt;,plottype=1){&lt;br /&gt;    &lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#this library does all the hard work&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#for windows users to install this packages see&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#&lt;/span&gt;&lt;span style="color: #3f3fbf;"&gt;http://cran.r-project.org/bin/windows/contrib/r-release/ReadMe&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#and &lt;/span&gt;&lt;span style="color: #3f3fbf;"&gt;http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.13/&lt;/span&gt;&lt;br /&gt;library(XML)&lt;br /&gt;&lt;br /&gt;theColours &amp;lt;- c(552,254,26)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;for&lt;/span&gt; (i in 1:&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;(compDates))&lt;br /&gt;{&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#go and read the tables from the web page&lt;/span&gt;&lt;br /&gt;thisDate &amp;lt;- paste(theURL,&lt;span style="color: #2a00ff;"&gt;"?asOf="&lt;/span&gt;,compDates[i],sep=&lt;span style="color: #2a00ff;"&gt;""&lt;/span&gt;)&lt;br /&gt;tables &amp;lt;- readHTMLTable(thisDate)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#get the table of interest&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#names(tables)&lt;/span&gt;&lt;br /&gt;lb &amp;lt;- tables[[&lt;span style="color: #2a00ff;"&gt;'leaderboard-table'&lt;/span&gt;]]&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#see what the columns are&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#colnames(lb)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#nrow(lb)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#convert to numeric - see ?readHTMLTable&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#numeric_cols &amp;lt;- c('AUC','Entries')&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#numeric_cols &amp;lt;- c('RMSLE','Entries')&lt;/span&gt;&lt;br /&gt;numeric_cols &amp;lt;- c(4,5)&lt;br /&gt;lb[numeric_cols] = lapply(lb[numeric_cols], function(x) as.numeric(gsub(&lt;span style="color: #2a00ff;"&gt;".* "&lt;/span&gt;, &lt;span style="color: #2a00ff;"&gt;""&lt;/span&gt;, as.character(x))))&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#extract the team name from a messy field&lt;/span&gt;&lt;br /&gt;team_col &amp;lt;- c(&lt;span style="color: #2a00ff;"&gt;'Team Name'&lt;/span&gt;)&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#lb[team_col]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#split the field by "\r\n" than denotes the break between fields within a field &lt;/span&gt;&lt;br /&gt;lb[,team_col] &amp;lt;- sapply(strsplit(as.character(lb[,team_col]), &lt;span style="color: #2a00ff;"&gt;"[&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\r&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\n&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;]"&lt;/span&gt;), function (x) x[1]) &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#####################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#      plot&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#####################&lt;/span&gt;&lt;br /&gt;myRank &amp;lt;- which(lb[team_col] == myTeam)&lt;br /&gt;myText = paste(&lt;span style="color: #2a00ff;"&gt;"following team"&lt;/span&gt;,myTeam)&lt;br /&gt;&lt;br /&gt;error_data &amp;lt;- lb[,4]&lt;br /&gt;entries_data &amp;lt;- lb[,5]&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(plottype==1){&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(i==1)  plot(error_data[1:topX],col = theColours[i],type=&lt;span style="color: #2a00ff;"&gt;"l"&lt;/span&gt;,xlab=&lt;span style="color: #2a00ff;"&gt;'Rank'&lt;/span&gt;,ylab=&lt;span style="color: #2a00ff;"&gt;'Error'&lt;/span&gt;,main = title, &lt;span style="color: #7f0055; font-weight: bold;"&gt;sub&lt;/span&gt;=myText)&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(i&amp;gt;1) points(error_data[1:topX],col = theColours[i],type=&lt;span style="color: #2a00ff;"&gt;"l"&lt;/span&gt;)&lt;br /&gt;          &lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#mark position&lt;/span&gt;&lt;br /&gt;points(myRank,error_data[myRank],col=theColours[i],pch=19)&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(plottype==2){&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(i==1){ plot(error_data[1:topX],entries_data[1:topX],type = &lt;span style="color: #2a00ff;"&gt;'p'&lt;/span&gt;&lt;br /&gt;            ,main = title&lt;br /&gt;            , xlab = &lt;span style="color: #2a00ff;"&gt;'Error'&lt;/span&gt;&lt;br /&gt;            , ylab = &lt;span style="color: #2a00ff;"&gt;'Entries'&lt;/span&gt;&lt;br /&gt;            , col= theColours[i]&lt;br /&gt;    )}&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(i&amp;gt;1) points(error_data[1:topX],entries_data[1:topX],type = &lt;span style="color: #2a00ff;"&gt;'p'&lt;/span&gt;, col= theColours[i])&lt;br /&gt;&lt;br /&gt;lines(lowess(error_data[1:topX],entries_data[1:topX]), col=theColours[i], lwd=2.5) &lt;span style="color: #3f7f59;"&gt;# lowess line (x,y)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#mark my position&lt;/span&gt;&lt;br /&gt;points(error_data[myRank],entries_data[myRank],col=theColours[i],pch=19)&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;legend(&lt;span style="color: #2a00ff;"&gt;"topleft"&lt;/span&gt;,legend=theDates,text.col=theColours)&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# End of Function&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# plotKaggleLeaderboard&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;compURL &amp;lt;- &lt;span style="color: #2a00ff;"&gt;"&lt;/span&gt;&lt;span style="color: #3f3fbf;"&gt;http://www.heritagehealthprize.com/c/hhp/Leaderboard&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;"&lt;/span&gt;&lt;br /&gt;compDates &amp;lt;- c(&lt;span style="color: #2a00ff;"&gt;'2011-09-24'&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;'2011-10-01'&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;'2011-10-07'&lt;/span&gt;)&lt;br /&gt;compTeam &amp;lt;- &lt;span style="color: #2a00ff;"&gt;'Eu Jin Lok'&lt;/span&gt;&lt;br /&gt;compTopX &amp;lt;- 150&lt;br /&gt;compTitle &amp;lt;- &lt;span style="color: #2a00ff;"&gt;'HHP Leaderboard Through Time'&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;plotKaggleLeaderboard(theURL = compURL&lt;br /&gt;            ,theDates = compDates&lt;br /&gt;                      ,myTeam = compTeam&lt;br /&gt;                      ,topX = compTopX&lt;br /&gt;                      ,title = compTitle&lt;br /&gt;                       ,plottype=1&lt;br /&gt;                      )&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-182895074757914860?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/182895074757914860/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/10/if-you-cant-measure-it-you-cant-manage.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/182895074757914860'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/182895074757914860'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/10/if-you-cant-measure-it-you-cant-manage.html' title='If you can&apos;t measure it, you can&apos;t manage it!'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-iqPUioEtqlo/To96nyuaBCI/AAAAAAAAAFo/fU1hk0LdcPc/s72-c/leaderboard%2Bafter%2Bcode1.PNG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-8486205815416124902</id><published>2011-10-01T06:43:00.000-07:00</published><updated>2011-10-01T06:55:37.909-07:00</updated><title type='text'>Code for a Respectable HHP Model</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;Our team Market Makers did well in the HHP first milestone prize, taking out top spot just above Willem Mestrom. Congratulations to all who took part.&lt;br /&gt;&lt;br /&gt;For those of you who haven't seen the presentation, there is a recording at&amp;nbsp;&lt;a href="http://www.youtube.com/watch?v=xziMY3nDB_E"&gt;http://www.youtube.com/watch?v=xziMY3nDB_E&lt;/a&gt; &lt;br /&gt;&lt;br /&gt;Reports were written on our methods and are available here... &lt;br /&gt;&lt;a href="http://www.heritagehealthprize.com/c/hhp/Leaderboard/milestone1"&gt;http://www.heritagehealthprize.com/c/hhp/Leaderboard/milestone1&lt;/a&gt; &lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;Other competitors now have 30 days to read the reports and come back with any concerns.&lt;br /&gt;&lt;br /&gt;In our report we included some source code, which is a bit of a pain to copy and paste, so I have reproduced it here. There are two bits of code to run, and at the end a file ready for submission should appear.&lt;br /&gt;&lt;br /&gt;The first bit is the data preparation which I did in SQL. I used SQL Server, but it should be easy to run it in any SQL database without much messing around. This took about 8 minutes to run, and at the end is a table ready for use by the second part, which is the modelling performed in R, which takes about 15 minutes.&lt;br /&gt;&lt;br /&gt;There are two modifications to the code to make - in the SQL you need to set the correct path to the source files - and in R you need to put in the correct server name so the data can be read directly from the database.&lt;br /&gt;&lt;br /&gt;So hopefully in two mouse clicks there will be a file sitting there ready for submission.&lt;br /&gt;&lt;br /&gt;If you are not sure about using SQL Server, then the earlier blog posts should point you in the right direction...&lt;br /&gt;&lt;br /&gt;&lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/progress-loading-hhp-data.html"&gt;http://anotherdataminingblog.blogspot.com/2011/05/progress-loading-hhp-data.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Note that the SQL provided does not generate all the variables we used - others were created, all using SQL, but the code was a little more complicated. The extra variables are described in our write up. Also the parameters used in the GBM modelling are not the most efficient. You will need more trees and a lower shrinkage factor to tease out a better model (see the R documentaion) - but this one runs in a reasonable time.&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;I am keen to see if this code is useful to anybody - if you can't get it to work then please post a comment hear. If you do use it then I would also like to hear from you!&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;SQL Code&lt;/div&gt;&lt;br /&gt;&lt;div style="border-color: black; border-style: solid; border-width: 1px; height: 300px; overflow: auto; width: 500px;"&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;/**************************************************************&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;* SQL Code to create an example data set for the HHP&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;*&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;* Edit the path in the 'bulk insert' commands to locate&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;* the source data&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;* The end result is a table called 'modelling_set' which can &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;* then be used to build predictive models&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;* &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;* created in SQL server express&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;* http://www.microsoft.com/sqlserver/en/us/editions/express.aspx&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;*****************************************************************/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;/**************************&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;create a new database&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;**************************/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CREATE DATABASE HHP_comp&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;USE HHP_comp&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;/**************************&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;load in the raw data&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;**************************/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;--claims&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CREATE TABLE Claims&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MemberID VARCHAR(8)  --integers starting with 0, could be text!&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, ProviderID VARCHAR(7)  --integers starting with 0, could be text!&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, Vendor VARCHAR(6)  --integers starting with 0, could be text!&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, PCP VARCHAR(5)  --integers starting with 0, could be text!&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, Year VARCHAR(2)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, Specialty VARCHAR(25)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, PlaceSvc VARCHAR(19)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, PayDelay VARCHAR(4)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, LengthOfStay VARCHAR(10)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, DSFS VARCHAR(12)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, PrimaryConditionGroup VARCHAR(8)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, CharlsonIndex VARCHAR(3)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, ProcedureGroup VARCHAR(4)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, SupLOS TINYINT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;BULK INSERT Claims&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM 'F:\comps\hhp\raw data\HHP_release2\Claims.csv'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WITH&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MAXERRORS = 0,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIRSTROW = 2,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIELDTERMINATOR = ',',&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ROWTERMINATOR = '\n'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;--members&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CREATE TABLE Members&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MemberID_M VARCHAR(8)  --integers starting with 0, could be text!&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, AgeAtFirstClaim VARCHAR(5)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, Sex VARCHAR(1)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;BULK INSERT Members&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM 'F:\comps\hhp\raw data\HHP_release2\Members.csv'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WITH&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MAXERRORS = 0,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIRSTROW = 2,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIELDTERMINATOR = ',',&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ROWTERMINATOR = '\n'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- drug count&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CREATE TABLE DrugCount&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MemberID INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, Year VARCHAR(2)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, DSFS VARCHAR(12)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, DrugCount VARCHAR(2)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;BULK INSERT DrugCount&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM 'F:\comps\hhp\raw data\HHP_release3\DrugCount.csv'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WITH&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MAXERRORS = 0,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIRSTROW = 2,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIELDTERMINATOR = ',',&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ROWTERMINATOR = '\n'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- Lab Count&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CREATE TABLE LabCount&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MemberID INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, Year VARCHAR(2)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, DSFS VARCHAR(12)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, LabCount VARCHAR(3)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;BULK INSERT LabCount&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM 'F:\comps\hhp\raw data\HHP_release3\LabCount.csv'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WITH&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MAXERRORS = 0,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIRSTROW = 2,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIELDTERMINATOR = ',',&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ROWTERMINATOR = '\n'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;--DaysInHospital_Y2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CREATE TABLE DaysInHospital_Y2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MemberID INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, ClaimsTruncated TINYINT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, DaysInHospital TINYINT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;BULK INSERT DaysInHospital_Y2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM 'F:\comps\hhp\raw data\HHP_release2\DaysInHospital_Y2.csv'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WITH&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MAXERRORS = 0,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIRSTROW = 2,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIELDTERMINATOR = ',',&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ROWTERMINATOR = '\n'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- DaysInHospital_Y3 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CREATE TABLE DaysInHospital_Y3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MemberID INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, ClaimsTruncated TINYINT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, DaysInHospital TINYINT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;BULK INSERT DaysInHospital_Y3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM 'F:\comps\hhp\raw data\HHP_release2\DaysInHospital_Y3.csv'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WITH&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MAXERRORS = 0,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIRSTROW = 2,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIELDTERMINATOR = ',',&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ROWTERMINATOR = '\n'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- Target&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CREATE TABLE Target&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MemberID INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, ClaimsTruncated TINYINT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;, DaysInHospital TINYINT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;BULK INSERT Target&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM 'F:\comps\hhp\raw data\HHP_release2\Target.csv'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WITH&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MAXERRORS = 0,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIRSTROW = 2,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FIELDTERMINATOR = ',',&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ROWTERMINATOR = '\n'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;/*************************&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;adjust the claims data to&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;convert text to integers&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;**************************/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- PayDelay&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Claims &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ADD PayDelayI integer&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Claims&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET PayDelayI = CASE WHEN PayDelay = '162+' THEN 162 ELSE CAST(PayDelay AS integer) END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;--dsfs&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Claims &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ADD dsfsI integer&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Claims&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET dsfsI =&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CASE&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '0- 1 month' THEN 1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '1- 2 months' THEN 2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '2- 3 months' THEN 3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '3- 4 months' THEN 4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '4- 5 months' THEN 5&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '5- 6 months' THEN 6&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '6- 7 months' THEN 7&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '7- 8 months' THEN 8&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '8- 9 months' THEN 9&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '9-10 months' THEN 10&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '10-11 months' THEN 11&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs = '11-12 months' THEN 12&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN dsfs IS NULL THEN NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- CharlsonIndex&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Claims &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ADD CharlsonIndexI INTEGER&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Claims&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET CharlsonIndexI =&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CASE&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN CharlsonIndex = '0' THEN 0 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN CharlsonIndex = '1-2' THEN 2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN CharlsonIndex = '3-4' THEN 4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN CharlsonIndex = '5+' THEN 6 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- LengthOfStay&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Claims &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ADD LengthOfStayI INTEGER&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Claims&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET LengthOfStayI =&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CASE&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay = '1 day'  THEN 1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay = '2 days' THEN 2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay = '3 days' THEN 3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay = '4 days' THEN 4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay = '5 days' THEN 5&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay = '6 days' THEN 6 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay = '1- 2 weeks' THEN 11&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay = '2- 4 weeks' THEN 21&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay = '4- 8 weeks' THEN 42&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay = '26+ weeks' THEN 180&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN LengthOfStay IS NULL THEN null&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;/**************************&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;create a summary table &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;at the member/year level&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;***************************/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;year&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,Memberid&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,COUNT(*) AS no_Claims&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,COUNT(DISTINCT ProviderID) AS no_Providers&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,COUNT(DISTINCT Vendor) AS no_Vendors&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,COUNT(DISTINCT PCP) AS no_PCPs&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,COUNT(DISTINCT PlaceSvc) AS no_PlaceSvcs&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,COUNT(DISTINCT Specialty) AS no_Specialities&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,COUNT(DISTINCT PrimaryConditionGroup) AS no_PrimaryConditionGroups&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,COUNT(DISTINCT ProcedureGroup) AS no_ProcedureGroups&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MAX(PayDelayI) AS PayDelay_max&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MIN(PayDelayI) AS PayDelay_min&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,AVG(PayDelayI) AS PayDelay_ave&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(PayDelayI) END) AS PayDelay_stdev&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MAX(LengthOfStayI) AS LOS_max&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MIN(LengthOfStayI) AS LOS_min&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,AVG(LengthOfStayI) AS LOS_ave&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(LengthOfStayI) END) AS LOS_stdev&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN LENGTHOFSTAY IS NULL AND SUPLOS = 0 THEN 1 ELSE 0 END) AS LOS_TOT_UNKNOWN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN LENGTHOFSTAY IS NULL AND SUPLOS = 1 THEN 1 ELSE 0  END) AS LOS_TOT_SUPRESSED &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN LENGTHOFSTAY IS NOT NULL THEN 1 ELSE 0 END) AS LOS_TOT_KNOWN &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MAX(dsfsI) AS dsfs_max&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MIN(dsfsI) AS dsfs_min&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MAX(dsfsI) - MIN(dsfsI) AS dsfs_range&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,AVG(dsfsI) AS dsfs_ave&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(dsfsI) END) AS dsfs_stdev&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MAX(CharlsonIndexI) AS CharlsonIndexI_max&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MIN(CharlsonIndexI) AS CharlsonIndexI_min&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,AVG(CharlsonIndexI) AS CharlsonIndexI_ave&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MAX(CharlsonIndexI) - MIN(CharlsonIndexI) AS CharlsonIndexI_range&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(CharlsonIndexI) END) AS CharlsonIndexI_stdev&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'MSC2a3' THEN 1 ELSE 0 END) AS pcg1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'METAB3' THEN 1 ELSE 0 END) AS pcg2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'ARTHSPIN' THEN 1 ELSE 0 END) AS pcg3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'NEUMENT' THEN 1 ELSE 0 END) AS pcg4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'RESPR4' THEN 1 ELSE 0 END) AS pcg5&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'MISCHRT' THEN 1 ELSE 0 END) AS pcg6&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'SKNAUT' THEN 1 ELSE 0 END) AS pcg7&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'GIBLEED' THEN 1 ELSE 0 END) AS pcg8&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'INFEC4' THEN 1 ELSE 0 END) AS pcg9&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'TRAUMA' THEN 1 ELSE 0 END) AS pcg10&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'HEART2' THEN 1 ELSE 0 END) AS pcg11&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL3' THEN 1 ELSE 0 END) AS pcg12&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'ROAMI' THEN 1 ELSE 0 END) AS pcg13&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'MISCL5' THEN 1 ELSE 0 END) AS pcg14&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'ODaBNCA' THEN 1 ELSE 0 END) AS pcg15&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'UTI' THEN 1 ELSE 0 END) AS pcg16&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'COPD' THEN 1 ELSE 0 END) AS pcg17&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'GYNEC1' THEN 1 ELSE 0 END) AS pcg18&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRB' THEN 1 ELSE 0 END) AS pcg19&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'FXDISLC' THEN 1 ELSE 0 END) AS pcg20&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'AMI' THEN 1 ELSE 0 END) AS pcg21&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'PRGNCY' THEN 1 ELSE 0 END) AS pcg22&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'HEMTOL' THEN 1 ELSE 0 END) AS pcg23&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'HEART4' THEN 1 ELSE 0 END) AS pcg24&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'SEIZURE' THEN 1 ELSE 0 END) AS pcg25&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'APPCHOL' THEN 1 ELSE 0 END) AS pcg26&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'CHF' THEN 1 ELSE 0 END) AS pcg27&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'GYNECA' THEN 1 ELSE 0 END) AS pcg28&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup IS NULL  THEN 1 ELSE 0 END) AS pcg29&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'PNEUM' THEN 1 ELSE 0 END) AS pcg30&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL2' THEN 1 ELSE 0 END) AS pcg31&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'GIOBSENT' THEN 1 ELSE 0 END) AS pcg32&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'STROKE' THEN 1 ELSE 0 END) AS pcg33&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRA' THEN 1 ELSE 0 END) AS pcg34&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'FLaELEC' THEN 1 ELSE 0 END) AS pcg35&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'MISCL1' THEN 1 ELSE 0 END) AS pcg36&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'HIPFX' THEN 1 ELSE 0 END) AS pcg37&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'METAB1' THEN 1 ELSE 0 END) AS pcg38&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'PERVALV' THEN 1 ELSE 0 END) AS pcg39&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'LIVERDZ' THEN 1 ELSE 0 END) AS pcg40&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'CATAST' THEN 1 ELSE 0 END) AS pcg41&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRM' THEN 1 ELSE 0 END) AS pcg42&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'PERINTL' THEN 1 ELSE 0 END) AS pcg43&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'PNCRDZ' THEN 1 ELSE 0 END) AS pcg44&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL1' THEN 1 ELSE 0 END) AS pcg45&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PrimaryConditionGroup = 'SEPSIS' THEN 1 ELSE 0 END) AS pcg46&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Internal' THEN 1 ELSE 0 END) AS sp1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Laboratory' THEN 1 ELSE 0 END) AS sp2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'General Practice' THEN 1 ELSE 0 END) AS sp3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Surgery' THEN 1 ELSE 0 END) AS sp4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Diagnostic Imaging' THEN 1 ELSE 0 END) AS sp5&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Emergency' THEN 1 ELSE 0 END) AS sp6&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Other' THEN 1 ELSE 0 END) AS sp7&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Pediatrics' THEN 1 ELSE 0 END) AS sp8&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Rehabilitation' THEN 1 ELSE 0 END) AS sp9&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Obstetrics and Gynecology' THEN 1 ELSE 0 END) AS sp10&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Anesthesiology' THEN 1 ELSE 0 END) AS sp11&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty = 'Pathology' THEN 1 ELSE 0 END) AS sp12&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN Specialty IS NULL THEN 1 ELSE 0 END) AS sp13&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'EM' THEN 1 ELSE 0 END  )  AS  pg1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'PL' THEN 1 ELSE 0 END  )  AS  pg2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'MED' THEN 1 ELSE 0 END  )  AS  pg3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SCS' THEN 1 ELSE 0 END  )  AS  pg4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'RAD' THEN 1 ELSE 0 END  )  AS  pg5&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SDS' THEN 1 ELSE 0 END  )  AS  pg6&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SIS' THEN 1 ELSE 0 END  )  AS  pg7&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SMS' THEN 1 ELSE 0 END  )  AS  pg8&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'ANES' THEN 1 ELSE 0 END  )  AS  pg9&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SGS' THEN 1 ELSE 0 END  )  AS  pg10&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SEOA' THEN 1 ELSE 0 END  )  AS  pg11&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SRS' THEN 1 ELSE 0 END  )  AS  pg12&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SNS' THEN 1 ELSE 0 END  )  AS  pg13&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SAS' THEN 1 ELSE 0 END  )  AS  pg14&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SUS' THEN 1 ELSE 0 END  )  AS  pg15&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup IS NULL  THEN 1 ELSE 0 END  )  AS  pg16&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SMCD' THEN 1 ELSE 0 END  )  AS  pg17&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN ProcedureGroup = 'SO' THEN 1 ELSE 0 END  )  AS  pg18&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PlaceSvc = 'Office' THEN 1 ELSE 0 END) AS ps1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PlaceSvc = 'Independent Lab' THEN 1 ELSE 0 END) AS ps2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PlaceSvc = 'Urgent Care' THEN 1 ELSE 0 END) AS ps3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PlaceSvc = 'Outpatient Hospital' THEN 1 ELSE 0 END) AS ps4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PlaceSvc = 'Inpatient Hospital' THEN 1 ELSE 0 END) AS ps5&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PlaceSvc = 'Ambulance' THEN 1 ELSE 0 END) AS ps6&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PlaceSvc = 'Other' THEN 1 ELSE 0 END) AS ps7&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PlaceSvc = 'Home' THEN 1 ELSE 0 END) AS ps8&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,SUM(CASE WHEN PlaceSvc IS NULL THEN 1 ELSE 0 END) AS ps9&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;INTO claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM Claims&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GROUP BY  year,Memberid&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- remove some nulls&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET LOS_max = 0 WHERE LOS_max IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET LOS_min = 0 WHERE LOS_min IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET LOS_ave = 0 WHERE LOS_ave IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET LOS_stdev = -1 WHERE LOS_stdev IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET dsfs_max = 0 WHERE dsfs_max IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET dsfs_min = 0 WHERE dsfs_min IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET dsfs_ave = 0 WHERE dsfs_ave IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET dsfs_stdev = -1 WHERE dsfs_stdev IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET dsfs_range = -1 WHERE dsfs_range IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE claims_per_member&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET CharlsonIndexI_range = -1 WHERE CharlsonIndexI_range IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;/***********************************&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;Members&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;***********************************/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- create binary flags for age&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members ADD age_05 INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members ADD age_15 INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members ADD age_25 INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members ADD age_35 INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members ADD age_45 INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members ADD age_55 INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members ADD age_65 INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members ADD age_75 INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members ADD age_85 INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members ADD age_MISS INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members SET age_05 = CASE WHEN ageATfirstclaim = '0-9' THEN 1 ELSE 0 END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members SET age_15 = CASE WHEN ageATfirstclaim = '10-19' THEN 1 ELSE 0 END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members SET age_25 = CASE WHEN ageATfirstclaim = '20-29' THEN 1 ELSE 0 END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members SET age_35 = CASE WHEN ageATfirstclaim = '30-39' THEN 1 ELSE 0 END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members SET age_45 = CASE WHEN ageATfirstclaim = '40-49' THEN 1 ELSE 0 END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members SET age_55 = CASE WHEN ageATfirstclaim = '50-59' THEN 1 ELSE 0 END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members SET age_65 = CASE WHEN ageATfirstclaim = '60-69' THEN 1 ELSE 0 END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members SET age_75 = CASE WHEN ageATfirstclaim = '70-79' THEN 1 ELSE 0 END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members SET age_85 = CASE WHEN ageATfirstclaim = '80+' THEN 1 ELSE 0 END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members SET age_MISS = CASE WHEN ageATfirstclaim IS NULL THEN 1 ELSE 0 END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;--create binary flags for sex&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ADD sexMALE INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET SexMALE = &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CASE&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN Sex = 'M' THEN 1 ELSE 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ADD sexFEMALE INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET SexFEMALE = &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CASE&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN Sex = 'F' THEN 1 ELSE 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE Members&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ADD sexMISS INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE Members&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET SexMISS = &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CASE&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;WHEN Sex IS NULL THEN 1 ELSE 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;/******************&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;DRUG COUNTS&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;******************/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- convert to integers&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE drugcount ADD DrugCountI INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE DRUGCOUNT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET DrugCountI = &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CASE WHEN DrugCount = '7+' THEN 7 ELSE DrugCount END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;memberID AS memberID_dc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,Year AS YEAR_dc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MAX(drugcountI) AS drugCount_max&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MIN(drugcountI) AS drugCount_min&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,AVG(drugcountI * 1.0) AS drugCount_ave&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,COUNT(*) AS drugcount_months&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;INTO DRUGCOUNT_SUMMARY&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;drugcount&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GROUP BY &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;memberID&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,Year&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;/******************&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;LAB COUNTS&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;******************/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- convert to integers&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE LabCount ADD LabCountI INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE LabCount&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SET LabCountI = &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;CASE WHEN LabCount = '10+' THEN 10 ELSE LabCount END&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;memberID AS memberID_lc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,Year AS YEAR_lc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MAX(labcountI) AS labCount_max&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,MIN(labcountI) AS labCount_min&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,AVG(labcountI * 1.0) AS labCount_ave&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,COUNT(*) AS labcount_months&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;INTO LABCOUNT_SUMMARY&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;labcount&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GROUP BY &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;memberID&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,Year&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;/********************************&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;Targets&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;********************************/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT * &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;INTO DIH&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;(&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MemberID AS MemberID_t&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,'Y1' AS YEAR_t&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,ClaimsTruncated&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,DaysInHospital &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,1 AS trainset&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM DaysInHospital_Y2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UNION ALL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MemberID AS MemberID_t&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,'Y2' AS YEAR_t&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,ClaimsTruncated&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,DaysInHospital &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,1 AS trainset&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM DaysInHospital_Y3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UNION ALL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;MemberID AS MemberID_t&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,'Y3' AS YEAR_t&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,ClaimsTruncated&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,null AS DaysInHospital&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;,0 AS trainset&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM Target&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;) a&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;/*****************************&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;Now merge them all together to&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;create the modeling data SET&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;******************************/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT a.*,b.*&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;INTO #temp1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;DIH a&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;LEFT OUTER JOIN &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;members b&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;on a.MemberID_t = B.Memberid_M&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp1 DROP COLUMN Memberid_M&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp1 DROP COLUMN AgeAtFirstClaim&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp1 DROP COLUMN Sex&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT a.*,b.*&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;INTO #temp2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;#temp1 a&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;LEFT OUTER JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;claims_per_member b&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;on a.MemberID_t = B.Memberid&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;AND a.YEAR_t = b.year&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp2 DROP COLUMN Memberid&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp2 DROP COLUMN year&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT a.*,b.*&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;INTO #temp3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;#temp2 a&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;LEFT OUTER JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;DRUGCOUNT_SUMMARY b&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;on a.MemberID_t = B.Memberid_dc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;AND a.YEAR_t = b.YEAR_dc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp3 DROP COLUMN Memberid_dc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp3 DROP COLUMN YEAR_dc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT a.*,b.*&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;INTO #temp4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;#temp3 a&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;LEFT OUTER JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;LABCOUNT_SUMMARY b&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;on a.MemberID_t = B.Memberid_lc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;AND a.YEAR_t = b.YEAR_lc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp4 DROP COLUMN Memberid_lc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp4 DROP COLUMN YEAR_lc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- removel nulls for those who had &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;-- no lab or drug information&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp4 ADD labNull INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;ALTER TABLE #temp4 ADD drugNull INT&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;GO&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET labNull = 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET labNull = 1 WHERE labCount_max IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET drugNull = 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET drugNull = 1 WHERE drugCount_max IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET labCount_max = 0 WHERE labCount_max IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET labCount_min = 0 WHERE labCount_min IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET labCount_ave = 0 WHERE labCount_ave IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET labcount_months = 0 WHERE labcount_months IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET drugCount_max = 0 WHERE drugCount_max IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET drugCount_min = 0 WHERE drugCount_min IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET drugCount_ave = 0 WHERE drugCount_ave IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;UPDATE #temp4 SET drugcount_months = 0 WHERE drugcount_months IS NULL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;SELECT * &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;INTO modelling_set&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;FROM #temp4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;R code&lt;br /&gt;&lt;br /&gt;&lt;div style="border-color: black; border-style: solid; border-width: 1px; height: 300px; overflow: auto; width: 500px;"&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;pre style="background: none repeat scroll 0% 0% rgb(255, 255, 255); color: black;"&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;########################################&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;# Example GBM model for HHP&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;# scores ~ 0.4635 on leaderboard&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;# which would be 55th position of 510&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;# as at 9th Sept 2011&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;# Requires the data having been prepared&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;# using the SQL supplied&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;########################################&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;starttime &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; proc&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;.&lt;/span&gt;&lt;span style="color: #400000; font-size: xx-small;"&gt;time&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;########################################&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#load the data&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;########################################&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;library&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;RODBC&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#set a connection to the database &lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;conn &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; odbcDriverConnect&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;"driver=SQL Server;database=HHP_comp;server=servername&lt;/span&gt;&lt;span style="color: #0f69ff; font-size: xx-small;"&gt;\\&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;SQLEXPRESS;"&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#or this method involves setting up a DSN (Data Source Name) called HHP_comp&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#conn &amp;lt;- odbcConnect("HHP_comp")&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;alldata &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; sqlQuery&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;conn&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;"select * from modelling_set"&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;########################################&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;# arrange the data&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;########################################&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#identify train and leaderboard data&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;trainrows &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; which&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;alldata$trainset &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #008c00; font-size: xx-small;"&gt;1&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;scorerows &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; which&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;alldata$trainset &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #008c00; font-size: xx-small;"&gt;0&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#sanity check the size of each set&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: #400000; font-size: xx-small;"&gt;length&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;trainrows&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: #400000; font-size: xx-small;"&gt;length&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;scorerows&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#display the column names&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;colnames&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;alldata&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#memberid is required as key for submission set&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;memberid &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; alldata&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;[&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;scorerows&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;'MemberID_t'&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;]&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#remove redundant fields&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;alldata$MemberID_t &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; NULL&lt;br /&gt;alldata$YEAR_t &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; NULL&lt;br /&gt;alldata$trainset &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; NULL&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#target - what we are predicting&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;theTarget &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;'DaysInHospital'&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#put the target on the log scale&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;alldata&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;[&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;trainrows&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;theTarget&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;]&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; log1p&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;alldata&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;[&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;trainrows&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;theTarget&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;]&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#find the position of the target&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;targindex &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;  which&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;names&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;alldata&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;theTarget&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;########################################&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;# build the model&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;########################################&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#GBM model settings, these can be varied&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;GBM_NTREES &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #008c00; font-size: xx-small;"&gt;500&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;GBM_SHRINKAGE &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #008c00; font-size: xx-small;"&gt;0.&lt;/span&gt;&lt;span style="color: #008c00; font-size: xx-small;"&gt;05&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;GBM_DEPTH &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #008c00; font-size: xx-small;"&gt;4&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;GBM_MINOBS &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #008c00; font-size: xx-small;"&gt;50&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#build the GBM model&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;library&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;gbm&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;GBM_model &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; gbm&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;.&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;fit&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;             &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;x&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; alldata&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;[&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;trainrows&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;targindex&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;]&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;            &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="color: maroon; font-size: xx-small; font-weight: bold;"&gt;y &lt;/span&gt;&lt;span style="color: maroon; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt; alldata[trainrows,targindex]&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;,distribution &lt;/span&gt;&lt;span style="color: maroon; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt; "gaussian"&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;,n.trees &lt;/span&gt;&lt;span style="color: maroon; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; GBM_NTREES&lt;br /&gt;            &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;shrinkage &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; GBM_SHRINKAGE&lt;br /&gt;            &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;interaction&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;.&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;depth &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; GBM_DEPTH&lt;br /&gt;            &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;n&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;.&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;minobsinnode &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; GBM_MINOBS&lt;br /&gt;            &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;verbose &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; TRUE&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#list variable importance&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;summary&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;GBM_model&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;GBM_NTREES&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#predict for the leaderboard data&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;prediction &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; predict&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;.&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;gbm&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;object &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; GBM_model&lt;br /&gt;              &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;newdata &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; alldata&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;[&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;scorerows&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;targindex&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;]&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;              &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;GBM_NTREES&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#put on correct scale and cap&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;prediction &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; expm1&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;prediction&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;prediction &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; pmin&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00; font-size: xx-small;"&gt;15&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;prediction&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;prediction &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; pmax&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00; font-size: xx-small;"&gt;0&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;prediction&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#plot the submission distribution&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;hist&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;prediction&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; breaks&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="color: #008c00; font-size: xx-small;"&gt;500&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;########################################&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;#write the submission to file&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: dimgrey; font-size: xx-small;"&gt;########################################&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;submission &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; cbind&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;memberid&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;prediction&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;colnames&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;submission&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; c&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;"MemberID"&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;"DaysInHospital"&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;fnname &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;"C:&lt;/span&gt;&lt;span style="color: #0f69ff; font-size: xx-small;"&gt;\\&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;GBM_demo1.csv"&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: #400000; font-size: xx-small;"&gt;write&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;.&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;csv&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;submission&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; file&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;fnname&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; row&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;.&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;names &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;=&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; FALSE&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;br /&gt;elapsedtime &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;&amp;lt;&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; proc&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;.&lt;/span&gt;&lt;span style="color: #400000; font-size: xx-small;"&gt;time&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; &lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;-&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt; starttime&lt;br /&gt;cat&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;(&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;"&lt;/span&gt;&lt;span style="color: #0f69ff; font-size: xx-small;"&gt;\n&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;Finished&lt;/span&gt;&lt;span style="color: #0f69ff; font-size: xx-small;"&gt;\n&lt;/span&gt;&lt;span style="color: #0000e6; font-size: xx-small;"&gt;"&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;,&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;elapsedtime&lt;/span&gt;&lt;span style="color: #808030; font-size: xx-small;"&gt;)&lt;/span&gt;&lt;span style="font-size: xx-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-8486205815416124902?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/8486205815416124902/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/10/code-for-respectable-hhp-model.html#comment-form' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8486205815416124902'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8486205815416124902'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/10/code-for-respectable-hhp-model.html' title='Code for a Respectable HHP Model'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-4423309138185632291</id><published>2011-08-31T23:30:00.000-07:00</published><updated>2011-09-01T00:16:39.299-07:00</updated><title type='text'>First Milestone</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;The time has now been and gone for submitting an entry for the first milestone, and the chance to win either $30,000 or $20,000 for the top two places.&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;The images below are the state of play (on the 30% visible data!) just after the first milestone ( I didn't capture this image in time so some teams have made an extra submission since).&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;There was a bit of movement in the last 2 days.&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt; Tom &amp;amp; Jerry changed their name to reveal themselves (Petterson &amp;amp; Caetano). This was interesting for me as they are from a relatively new machine learning lab here in Australia that I had not been aware of. They jumped up to 3rd position before the last submission but were overhauled at the last minute&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&amp;nbsp;Willem Mestrom made a big leap on his last submission to get into 3rd position. He was obviously keeping his best (or was he?) until the very last moment. This might be a good tactical play if he looking towards the $3 million rather than the short term smaller prizes.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Planet Melbourne, after making consistent daily submissions throughout the competition did not take the opportunity to make one last submission.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;It is good to see Chris R still in the top 10. Chris has been up there since the very start.Interestingly Chris was cut off from submitting for a while due to Hurricane Irene taking his power away.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&amp;nbsp;Interestingly, positions 2,3 &amp;amp; 4 are now taken by members of &lt;a href="http://www.the-ensemble.com/"&gt;The Ensemble&lt;/a&gt; from Netflix fame. Position 1 contains Dave, who was in an original team called Ensemble Team, and also entered Netflix as a member of &lt;a href="http://www.findmorefives.com/"&gt;Ensemble Experts&lt;/a&gt;&lt;/span&gt; ,&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt; and I (also a member of Market Makers) ran a competition on &lt;a href="http://www.tiberius.biz/ausdm09/index.html"&gt;Ensembling the Netflix Data&lt;/a&gt; and have looked at &lt;a href="http://www.tiberius.biz/pakdd07.html"&gt;ensembling other competitions&lt;/a&gt;.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;There seems to be a pattern here!&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&amp;nbsp;&lt;/span&gt;&lt;ul style="text-align: left;"&gt;&lt;/ul&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;Note you can click on the images to enlarge them.&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-8MAePMCK-Yk/Tl8flhUO7WI/AAAAAAAAAFU/FeoesncWsSw/s1600/leaderboard_after_milestone.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="213" src="http://4.bp.blogspot.com/-8MAePMCK-Yk/Tl8flhUO7WI/AAAAAAAAAFU/FeoesncWsSw/s320/leaderboard_after_milestone.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-8gYxLbGW1j4/Tl8fvHyG6gI/AAAAAAAAAFY/4UCptzbrejQ/s1600/leaderboardrank.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/-8gYxLbGW1j4/Tl8fvHyG6gI/AAAAAAAAAFY/4UCptzbrejQ/s320/leaderboardrank.PNG" width="310" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-rlemw2XyLnk/Tl8f20RkUOI/AAAAAAAAAFc/22Pzf_ZXpmg/s1600/leaderboardranktop100.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="318" src="http://2.bp.blogspot.com/-rlemw2XyLnk/Tl8f20RkUOI/AAAAAAAAAFc/22Pzf_ZXpmg/s320/leaderboardranktop100.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;The plot below shows the Lowess line fit of the error as a function of the number of entries submitted.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-QZuedvKsv30/Tl8f9S_N-OI/AAAAAAAAAFg/-Gjzvn7zX3o/s1600/leaderboardloess.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="318" src="http://2.bp.blogspot.com/-QZuedvKsv30/Tl8f9S_N-OI/AAAAAAAAAFg/-Gjzvn7zX3o/s320/leaderboardloess.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-4423309138185632291?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/4423309138185632291/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/08/first-milestone.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/4423309138185632291'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/4423309138185632291'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/08/first-milestone.html' title='First Milestone'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-8MAePMCK-Yk/Tl8flhUO7WI/AAAAAAAAAFU/FeoesncWsSw/s72-c/leaderboard_after_milestone.PNG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-8599729805896552467</id><published>2011-08-27T22:29:00.000-07:00</published><updated>2011-08-27T22:30:13.226-07:00</updated><title type='text'>First Milestone Just Around the Corner</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;There is about 4 days to go until the deadline for the first milestone, so I thought it would be interesting to capture the leaderboard now and then in 4 days, to see if things change wildly. At this stage it can be like a game of poker.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;All the top 10 teams are still in the hunt and making daily submissions. Tom &amp;amp; Jerry were late starters but have made astonishing progress to 3rd place. They currently split 3 others who were part of &lt;a href="http://www.the-ensemble.com/"&gt;'The Ensemble'&lt;/a&gt;, Edward, Willem and B.Yang.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;So many teams at the top all have a proven history in data mining competitions. I will be very happy&amp;nbsp; if those ensemble dudes repeat their performance in the Netflix Prize ;-). &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-zcYw4wYPdbM/TlnN1zS5PxI/AAAAAAAAAFQ/EuP4B3_2Z6o/s1600/leaderboard4daystogo.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="214" src="http://4.bp.blogspot.com/-zcYw4wYPdbM/TlnN1zS5PxI/AAAAAAAAAFQ/EuP4B3_2Z6o/s320/leaderboard4daystogo.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-8599729805896552467?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/8599729805896552467/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/08/first-milestone-just-around-te-corner.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8599729805896552467'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8599729805896552467'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/08/first-milestone-just-around-te-corner.html' title='First Milestone Just Around the Corner'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-zcYw4wYPdbM/TlnN1zS5PxI/AAAAAAAAAFQ/EuP4B3_2Z6o/s72-c/leaderboard4daystogo.PNG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-7968671182834638789</id><published>2011-08-19T03:34:00.000-07:00</published><updated>2011-08-19T03:38:11.227-07:00</updated><title type='text'>1 minute of fame</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;The HHP competition and Kaggle are catching the attention of the media over here in Australia.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;I was on TV recently (and yes, it does add a few Kg to you),&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;a href="http://www.abc.net.au/catalyst/stories/3296837.htm"&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;http://www.abc.net.au/catalyst/stories/3296837.htm&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;and in the paper (2nd last paragraph interesting!),&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;a href="http://www.smh.com.au/business/miners-of-a-different-kind-in-line-for-riches-of-their-own-20110731-1i6bi.html"&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;http://www.smh.com.au/business/miners-of-a-different-kind-in-line-for-riches-of-their-own-20110731-1i6bi.html&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Blog posts may be limited in the next 2 weeks as its nose down to try and catch Edward.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-7968671182834638789?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/7968671182834638789/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/08/1-minute-of-fame.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7968671182834638789'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7968671182834638789'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/08/1-minute-of-fame.html' title='1 minute of fame'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-4386150124485533316</id><published>2011-08-03T09:02:00.000-07:00</published><updated>2011-08-03T09:04:49.232-07:00</updated><title type='text'>Bath Mawr Coch</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;In &lt;/span&gt;&lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/day-i-led-hhp-briefly.html" style="font-family: Arial,Helvetica,sans-serif;"&gt;one of my first posts&lt;/a&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt; you will see how my eagerness to gloat of my success unleashed Dave into the HHP. He has subsequently been at the top of the leaderboard from the beginning, whereas my progress was more gradual. Eventually we seemed to converge to a similar score and even held positions 1 &amp;amp; 2 for a while, which was very surprising to me given the caliber of contestants that this competition has attracted.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;We have now decided to join forces, hence the appearance of team Bath Mawr Coch. Hopefully there will be some synergy in our methods.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Dave and I are both competition junkies. We first cam into contact as a result of the&lt;a href="http://www.sigkdd.org/kddcup/index.php?section=2004&amp;amp;method=result"&gt; KDDCup 2004 contest&lt;/a&gt;, which Dave won - and you will see my effort in a lowly 11th place. The leaderboard was left open after the conclusion of the contest, and&lt;a href="http://osmot.cs.cornell.edu/cgi-bin/newtable.pl?prob=phy"&gt; I eventually got to top spot&lt;/a&gt;, which is obviously not that impressive when the winners had already revealed what they did - but I had learnt things and been able to implement them. I then contacted Dave to gloat - and he immediately overtook me.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;We have since worked on a couple of other competitions together, and won the &lt;a href="http://www.tiberius.biz/accuracy.html"&gt;NCDM Analytic challenge in 2008.&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Competitions like the HHP are a great way to learn and to also meet others with similar interests. I'm sure many collaborations and friendships will be made as a result, to the benefit of everyone.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt; &amp;nbsp; &amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-4386150124485533316?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/4386150124485533316/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/08/bath-mawr-coch.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/4386150124485533316'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/4386150124485533316'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/08/bath-mawr-coch.html' title='Bath Mawr Coch'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-3456667063614407931</id><published>2011-08-02T04:50:00.000-07:00</published><updated>2011-08-02T05:15:57.906-07:00</updated><title type='text'>Gone Shopping</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;Another New Comp&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Kaggle have just posted a new&amp;nbsp;competition&lt;a href="http://www.kaggle.com/c/dunnhumbychallenge"&gt;&amp;nbsp;Dunnhumby's Shopper Challenge&lt;/a&gt; that is a very interesting one. The data is historical records of when customers&amp;nbsp;visited&amp;nbsp;a store and how much they spent. The goal is to predict when customers will next visit the store and how much they will spend.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;This is probably a novel real world data set and I expect the interest in this&amp;nbsp;competition&amp;nbsp;to be high. The reasons I think this are&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol style="text-align: left;"&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;The data is simple&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;There is not a massive amount of data, so processing power will not be an issue&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;It is a novel problem where creativity is needed and a new algorithm will probably have to be developed. It is not going to be a case of who can pre-process the data best and build the best ensemble using existing algorithms&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;So, where to start?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;The first thing I always do is run the data through my Nifty Tool &lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/progress-loading-hhp-data.html"&gt;(see here for example&lt;/a&gt;) to check it and generate some SQL so I can load it into a database.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Here is what comes out...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;CREATE&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;DATABASE&lt;/span&gt; dunnhumby&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;USE&lt;/span&gt; dunnhumby&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;CREATE&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;TABLE&lt;/span&gt; training&lt;br /&gt;(&lt;br /&gt;    customer_id    &lt;span style="color: #7f0055; font-weight: bold;"&gt;int&lt;/span&gt;&lt;br /&gt;,    visit_date    &lt;span style="color: #7f0055; font-weight: bold;"&gt;date&lt;/span&gt;&lt;br /&gt;,    visit_spend    &lt;span style="color: #7f0055; font-weight: bold;"&gt;float&lt;/span&gt;&lt;br /&gt;)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;BULK&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;INSERT&lt;/span&gt; training&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;FROM&lt;/span&gt; &lt;span style="color: #2a00ff;"&gt;'E:\comps\dunnhumby\training.csv'&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;WITH&lt;/span&gt;&lt;br /&gt;(&lt;br /&gt;MAXERRORS = 0,&lt;br /&gt;FIRSTROW = 2,&lt;br /&gt;FIELDTERMINATOR = &lt;span style="color: #2a00ff;"&gt;','&lt;/span&gt;,&lt;br /&gt;ROWTERMINATOR = &lt;span style="color: #2a00ff;"&gt;'\n'&lt;/span&gt;&lt;br /&gt;)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;--(12146637 row(s) affected)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;CREATE&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;TABLE&lt;/span&gt; test&lt;br /&gt;(&lt;br /&gt;    customer_id    &lt;span style="color: #7f0055; font-weight: bold;"&gt;int&lt;/span&gt;&lt;br /&gt;,    visit_date    &lt;span style="color: #7f0055; font-weight: bold;"&gt;date&lt;/span&gt;&lt;br /&gt;,    visit_spend    &lt;span style="color: #7f0055; font-weight: bold;"&gt;float&lt;/span&gt;&lt;br /&gt;)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;BULK&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;INSERT&lt;/span&gt; test&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;FROM&lt;/span&gt; &lt;span style="color: #2a00ff;"&gt;'E:\comps\dunnhumby\test.csv'&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;WITH&lt;/span&gt;&lt;br /&gt;(&lt;br /&gt;MAXERRORS = 0,&lt;br /&gt;FIRSTROW = 2,&lt;br /&gt;FIELDTERMINATOR = &lt;span style="color: #2a00ff;"&gt;','&lt;/span&gt;,&lt;br /&gt;ROWTERMINATOR = &lt;span style="color: #2a00ff;"&gt;'\n'&lt;/span&gt;&lt;br /&gt;)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;The data set is relatively nice, just 3 coulumns;&lt;/span&gt;&lt;/div&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;customer ID&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;date&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;amount spent&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;the dates are from 1st April 2010 to 18th July 2011.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;A Naive Guess&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;I always like to get in a submission without actually really looking at the data, just to make sure I am reading the data correctly and that the submission file is in the correct format.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;In forecasting, a naive prediction is not going to be far wrong in the long run. If you want to forecast the weather, then saying tomorrow is going to be the same as today is going to serve you well in the long run if you have no other information to go off. So for this problem, we will say the next spend will be the same as the last spend, and the gap in days will also be the same.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;To achieve this in SQL is a bit tricky, but can be done. The code below does the trick and when submitted gets 9.5% correct on the leaderboard, the same as the simple baseline benchmark.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;/*************************************************&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;SQL to generate a naive prediction&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;- spend will be the same as the last visit&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;- the next visit will be in the same number&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;&amp;nbsp;&amp;nbsp;of days as the gap between the previous visit&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;**************************************************/&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;-- append a visit number to the data - visit 1 = most recent visit&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;select&lt;/span&gt; &lt;span style="background-attachment: initial; background-clip: initial; background-color: #dd0000; background-image: initial; background-origin: initial; color: white; font-style: italic; font-weight: bold;"&gt;*&lt;/span&gt;, Rank() over (Partition &lt;span style="color: #7f0055; font-weight: bold;"&gt;BY&lt;/span&gt; customer_id &lt;span style="color: #7f0055; font-weight: bold;"&gt;order&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;by&lt;/span&gt; visit_date &lt;span style="color: #7f0055; font-weight: bold;"&gt;desc&lt;/span&gt;) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; visit_number&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;into&lt;/span&gt; &lt;span style="color: #3f7f59;"&gt;#temp1&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;from&lt;/span&gt; test&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;-- create field which is days since previous visit&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;select&lt;/span&gt; a.*&lt;br /&gt;,b.visit_date &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; previous_visit_date&lt;br /&gt;,DATEDIFF(DD,b.visit_date,a.visit_date) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; days_since_previous_visit&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;into&lt;/span&gt; &lt;span style="color: #3f7f59;"&gt;#temp2&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;from&lt;/span&gt; &lt;span style="color: #3f7f59;"&gt;#temp1 a inner join #temp1 b&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;on&lt;/span&gt; a.customer_id = b.customer_id&lt;br /&gt;and a.visit_number = b.visit_number - 1&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;where&lt;/span&gt; a.visit_number = 1&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;-- generate the submission file, makink sure 1st April is earliest data&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;select&lt;/span&gt;&lt;/pre&gt;&lt;pre style="background-attachment: initial; background-clip: initial; background-color: white; background-image: initial; background-origin: initial; background-position: initial initial; background-repeat: initial initial;"&gt;customer_id&lt;br /&gt;,&lt;br /&gt;    (&lt;span style="color: #7f0055; font-weight: bold;"&gt;case&lt;/span&gt; &lt;br /&gt;        &lt;span style="color: #7f0055; font-weight: bold;"&gt;when&lt;/span&gt; dateadd(dd,days_since_previous_visit,visit_date) &amp;lt; &lt;span style="color: #2a00ff;"&gt;'2011-04-01'&lt;/span&gt; &lt;br /&gt;        &lt;span style="color: #7f0055; font-weight: bold;"&gt;then&lt;/span&gt; &lt;span style="color: #2a00ff;"&gt;'2011-04-01'&lt;/span&gt; &lt;br /&gt;        &lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt; dateadd(dd,days_since_previous_visit,visit_date)&lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;end&lt;/span&gt;) &lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; visit_date&lt;br /&gt;,visit_spend&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;from&lt;/span&gt; &lt;span style="color: #3f7f59;"&gt;#temp2&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;order&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;by&lt;/span&gt; customer_id&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Now Lets Look at the Data&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;In order to get a feel for things, we will aggregate the data to a daily level and look at how the population as a whole goes shopping. We will create a count of the total number of visits per day and generate some time based fields that will help us understand the daily and seasonal patterns in shopping. The SQL below will generate that data, which is small enough to save in Excel or as a text file and then load in your favourite analysis package, or we could just create a table in the database and load directly from there (don't leave as a view as it takes a while to generate).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;-- add a field which is the days since 1st April (1st April 2010)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;alter&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;table&lt;/span&gt; training &lt;span style="color: #7f0055; font-weight: bold;"&gt;add&lt;/span&gt; daysSinceStart &lt;span style="color: #7f0055; font-weight: bold;"&gt;int&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;update&lt;/span&gt; training &lt;span style="color: #7f0055; font-weight: bold;"&gt;set&lt;/span&gt; daysSinceStart = &lt;span style="color: #7f0055; font-weight: bold;"&gt;DATEDIFF&lt;/span&gt;(dd,&lt;span style="color: #2a00ff;"&gt;'2010-04-01'&lt;/span&gt;,visit_date)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;-- create a daily summary dataset&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;select&lt;/span&gt; visit_date&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;-- fields of interest&lt;/span&gt;&lt;br /&gt;, &lt;span style="color: #7f0055; font-weight: bold;"&gt;COUNT&lt;/span&gt;(*) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; visits&lt;br /&gt;, &lt;span style="color: #7f0055; font-weight: bold;"&gt;avg&lt;/span&gt;(visit_spend) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; avg_visit_spend&lt;br /&gt;, stdev(visit_spend) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; stdv_visit_spend&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;-- time based predictor variables&lt;/span&gt;&lt;br /&gt;, &lt;span style="color: #7f0055; font-weight: bold;"&gt;min&lt;/span&gt;(daysSinceStart) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; daysSinceStart&lt;br /&gt;, &lt;span style="color: #7f0055; font-weight: bold;"&gt;sin&lt;/span&gt;(2 * pi() * (&lt;span style="color: #7f0055; font-weight: bold;"&gt;DATEPART&lt;/span&gt;(DAYOFYEAR, visit_date) * 1.0 / 365.0)) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; doySin &lt;br /&gt;, &lt;span style="color: #7f0055; font-weight: bold;"&gt;cos&lt;/span&gt;(2 * pi() * (&lt;span style="color: #7f0055; font-weight: bold;"&gt;DATEPART&lt;/span&gt;(DAYOFYEAR, visit_date) * 1.0 / 365.0)) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; doyCos&lt;br /&gt;, (&lt;span style="color: #7f0055; font-weight: bold;"&gt;case&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;when&lt;/span&gt; (&lt;span style="color: #7f0055; font-weight: bold;"&gt;DATEPART&lt;/span&gt;(WEEKDAY, visit_date)) = 1 &lt;span style="color: #7f0055; font-weight: bold;"&gt;then&lt;/span&gt; 1 &lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt; 0 &lt;span style="color: #7f0055; font-weight: bold;"&gt;end&lt;/span&gt;) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; dowSun&lt;br /&gt;, (&lt;span style="color: #7f0055; font-weight: bold;"&gt;case&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;when&lt;/span&gt; (&lt;span style="color: #7f0055; font-weight: bold;"&gt;DATEPART&lt;/span&gt;(WEEKDAY, visit_date)) = 2 &lt;span style="color: #7f0055; font-weight: bold;"&gt;then&lt;/span&gt; 1 &lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt; 0 &lt;span style="color: #7f0055; font-weight: bold;"&gt;end&lt;/span&gt;) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; dowMon&lt;br /&gt;, (&lt;span style="color: #7f0055; font-weight: bold;"&gt;case&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;when&lt;/span&gt; (&lt;span style="color: #7f0055; font-weight: bold;"&gt;DATEPART&lt;/span&gt;(WEEKDAY, visit_date)) = 3 &lt;span style="color: #7f0055; font-weight: bold;"&gt;then&lt;/span&gt; 1 &lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt; 0 &lt;span style="color: #7f0055; font-weight: bold;"&gt;end&lt;/span&gt;) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; dowTue&lt;br /&gt;, (&lt;span style="color: #7f0055; font-weight: bold;"&gt;case&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;when&lt;/span&gt; (&lt;span style="color: #7f0055; font-weight: bold;"&gt;DATEPART&lt;/span&gt;(WEEKDAY, visit_date)) = 4 &lt;span style="color: #7f0055; font-weight: bold;"&gt;then&lt;/span&gt; 1 &lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt; 0 &lt;span style="color: #7f0055; font-weight: bold;"&gt;end&lt;/span&gt;) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; dowWed&lt;br /&gt;, (&lt;span style="color: #7f0055; font-weight: bold;"&gt;case&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;when&lt;/span&gt; (&lt;span style="color: #7f0055; font-weight: bold;"&gt;DATEPART&lt;/span&gt;(WEEKDAY, visit_date)) = 5 &lt;span style="color: #7f0055; font-weight: bold;"&gt;then&lt;/span&gt; 1 &lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt; 0 &lt;span style="color: #7f0055; font-weight: bold;"&gt;end&lt;/span&gt;) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; dowThu&lt;br /&gt;, (&lt;span style="color: #7f0055; font-weight: bold;"&gt;case&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;when&lt;/span&gt; (&lt;span style="color: #7f0055; font-weight: bold;"&gt;DATEPART&lt;/span&gt;(WEEKDAY, visit_date)) = 6 &lt;span style="color: #7f0055; font-weight: bold;"&gt;then&lt;/span&gt; 1 &lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt; 0 &lt;span style="color: #7f0055; font-weight: bold;"&gt;end&lt;/span&gt;) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; dowFri&lt;br /&gt;, (&lt;span style="color: #7f0055; font-weight: bold;"&gt;case&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;when&lt;/span&gt; (&lt;span style="color: #7f0055; font-weight: bold;"&gt;DATEPART&lt;/span&gt;(WEEKDAY, visit_date)) = 7 &lt;span style="color: #7f0055; font-weight: bold;"&gt;then&lt;/span&gt; 1 &lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt; 0 &lt;span style="color: #7f0055; font-weight: bold;"&gt;end&lt;/span&gt;) &lt;span style="color: #7f0055; font-weight: bold;"&gt;as&lt;/span&gt; dowSat&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;from&lt;/span&gt; dbo.training&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;group&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;by&lt;/span&gt; visit_date&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;order&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;by&lt;/span&gt; visit_date &lt;span style="color: #7f0055; font-weight: bold;"&gt;asc&lt;/span&gt;&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;a href="http://www.tiberius.biz/"&gt;Tiberius&lt;/a&gt; is an ideal tool for analysing this type of data, as I initially wrote it to look at a similar type of data set - electricity consumption - and the patterns I want to discover will be similar.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;I have built a model to predict visits as a function of time. The model predictors are weekly, seasonal and trend components, so we are trying to model the number of visits based solely on time. The plot below shows the errors in the model, the points are in date order.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;(Click on the plots to enlarge)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-BMtQ55CFQhI/TjfdDxRn-aI/AAAAAAAAAE8/O6QCr9VFvyQ/s1600/model_errors.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="173" src="http://1.bp.blogspot.com/-BMtQ55CFQhI/TjfdDxRn-aI/AAAAAAAAAE8/O6QCr9VFvyQ/s320/model_errors.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;What we see are that there are certain dates where the errors stand out. These are easily identifed in Tiberius by rolling your mouse over the image and the dates are displayed.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;The two big anomalies are 25th Dec 2010 and 1st Jan 2011 - Christmas Day and New Years Day. So if we are trying to deduce where this data set is from, China is probably crossed off the list.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Other dates with errors are&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol style="text-align: left;"&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;4th April 2010&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;3rd May 2010&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;31st May 2010&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;30th August 2010&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;a href="http://www.woodlands-junior.kent.sch.uk/customs/questions/bankholidays.html"&gt;A bit of research&lt;/a&gt; will identify that these are all England &amp;amp; Wales public holidays, so I think we have figured out where the data set originates from (hello Tesco!).&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;The model of visits can be decomposed into weekly, annual and a trend components.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;The weekly component shows Sunday is the quietest day, and Friday the busiest day. Monday, Tuesday and Wednesday are all similar.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-jWd4M600MeQ/Tjfhl-If1iI/AAAAAAAAAFA/lgFMVdR_i4A/s1600/weekly_component.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="173" src="http://2.bp.blogspot.com/-jWd4M600MeQ/Tjfhl-If1iI/AAAAAAAAAFA/lgFMVdR_i4A/s320/weekly_component.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;The Seasonal component shows early April is the busiest time of year.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-Kmw3mcsn_SA/Tjfhv89P9iI/AAAAAAAAAFE/FVDuK2InfA4/s1600/seasonal_component.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://4.bp.blogspot.com/-Kmw3mcsn_SA/Tjfhv89P9iI/AAAAAAAAAFE/FVDuK2InfA4/s320/seasonal_component.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;And unfortunately the trend looks to be going downhill.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-NJp_lR5o7oM/Tjfh1-GKVrI/AAAAAAAAAFI/xhPXlMi-MNo/s1600/trend_component.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="172" src="http://2.bp.blogspot.com/-NJp_lR5o7oM/Tjfh1-GKVrI/AAAAAAAAAFI/xhPXlMi-MNo/s320/trend_component.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Finally here is an interesting plot of the standard deviation of the spend,&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-Imv0DE2lcYQ/TjfqbRB7jYI/AAAAAAAAAFM/rLp4Mf2K20c/s1600/spend_standard_deviation.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="170" src="http://1.bp.blogspot.com/-Imv0DE2lcYQ/TjfqbRB7jYI/AAAAAAAAAFM/rLp4Mf2K20c/s320/spend_standard_deviation.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;So we are on our way to understanding this problem.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Good luck everyone, hope the challenge keeps you busy!&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-3456667063614407931?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/3456667063614407931/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/08/gone-shopping.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/3456667063614407931'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/3456667063614407931'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/08/gone-shopping.html' title='Gone Shopping'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-BMtQ55CFQhI/TjfdDxRn-aI/AAAAAAAAAE8/O6QCr9VFvyQ/s72-c/model_errors.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-9162953577220522955</id><published>2011-07-30T17:24:00.000-07:00</published><updated>2011-07-30T17:39:03.192-07:00</updated><title type='text'>No Change at the Top</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span class="Apple-style-span" style="font-family: Arial,Helvetica,sans-serif;"&gt;I just spotted something on the HHP leaderboard that I have not seen before. There has been no change in the top 12 positions for 1 week.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial,Helvetica,sans-serif;"&gt;Does this mean we are all running out of ideas? Teams are submitting but the incremental improvement seems to be relatively&amp;nbsp;minuscule.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial,Helvetica,sans-serif;"&gt;Dave (&lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/day-i-led-hhp-briefly.html"&gt;He is smarter than me!&lt;/a&gt;) &amp;nbsp;and Willem seem to have a big gap at the top, and Edward is close on my tail for 3rd place. The team I would look out for though are some Old Dogs With New Tricks.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-o2XpPDMNbBg/TjSecvoLtJI/AAAAAAAAAE0/FdtSoSsfMXQ/s1600/leaderboardnochange1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="275" src="http://1.bp.blogspot.com/-o2XpPDMNbBg/TjSecvoLtJI/AAAAAAAAAE0/FdtSoSsfMXQ/s320/leaderboardnochange1.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-FR9euBx_5nA/TjSkGBt-uYI/AAAAAAAAAE4/PoubW54nOtE/s1600/top50.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="308" src="http://4.bp.blogspot.com/-FR9euBx_5nA/TjSkGBt-uYI/AAAAAAAAAE4/PoubW54nOtE/s320/top50.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-9162953577220522955?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/9162953577220522955/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/no-change-at-top.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/9162953577220522955'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/9162953577220522955'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/no-change-at-top.html' title='No Change at the Top'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-o2XpPDMNbBg/TjSecvoLtJI/AAAAAAAAAE0/FdtSoSsfMXQ/s72-c/leaderboardnochange1.PNG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-3580598476147107374</id><published>2011-07-29T01:52:00.000-07:00</published><updated>2011-07-29T01:52:34.669-07:00</updated><title type='text'>A1 Melbourne and MelbURN: R vs SAS</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Last night was another double header in the Melbourne analytic scene. Someone else has already described the events I attended, so pointing to their post will save me some typing. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;a href="http://analystfirst.com/2011/07/28/574/a1-melbourne-and-melburn-r-vs-sas/#.TjJ0D_pHtaA.blogger"&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;A1 Melbourne and MelbURN: R vs SAS&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-3580598476147107374?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/3580598476147107374/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/a1-melbourne-and-melburn-r-vs-sas.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/3580598476147107374'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/3580598476147107374'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/a1-melbourne-and-melburn-r-vs-sas.html' title='A1 Melbourne and MelbURN: R vs SAS'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-8357156436787141833</id><published>2011-07-20T17:01:00.000-07:00</published><updated>2011-07-20T17:06:19.711-07:00</updated><title type='text'>Damn Computers II</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;In an earlier post I grumbled a bit about how computers can be very frustrating...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/damn-computers.html"&gt;http://anotherdataminingblog.blogspot.com/2011/05/damn-computers.html&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;here is another one for the collection!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;Chrome v Internet Explorer 9&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;As described in the previous post I have just upgraded my PC. Now generally my experience with Windows is that when new, computers are quite quick but with time they seem to slow down and will eventually take 30 mins just to turn on. There is a school of thought that this is an inbuilt Microsoft conspiracy to ensure you keep upgrading your operating system when they release a new one - which seems to be every second week at the moment. The other theory is that in time you install more software that want to do things for you without asking (why does Adobe think I need it to check for updates every 5 mins).&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Wanting to not clutter my new machine I am resisting installing anything that is not essential at the moment. I declined the 'trial' version of the latest Microsoft Office suite during the install of the o/s. This caused a problem when I quickly wanted to jot some numbers down in a spreadsheet. Rather than install my old licenced version of Office 2000, I downloaded the free &lt;a href="http://www.openoffice.org/"&gt;Open Office&lt;/a&gt; and was very impressed - seamless integration with Microsoft Office so no issues you won't be able to read files others send you and it basically looks the same - I think I will be sticking with that from now on.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;I also resisted downloading the Firefox browser and stuck with IE, until now that is.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;I wanted to reply to a blog post. Having spent a while composing the reply, I tried to submit it and got stuck in an endless loop of being prompted to log into my Google account. Having experienced things like this before, my first option these days is to try a different browser. I went for Chrome this time, just on the hunch that seeing Chrome was a Google offering, as is Blogger then it should work. And it did - first time.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Having developed a few web sites, these differences between browsers are a bit of a pain. You really have to test everything out for all scenarios these days. On the Kaggle leaderboard I'm getting different views depending on what browser I am using. In IE my team name is just in bold, in Firefox and Chrome the whole cell is red making it easier to spot me, which I think is the look that it intended.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Internet Explorer - no colouring of the cells&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-ZzrpCmLvOdM/Tidqh5Z_t0I/AAAAAAAAAEs/rhpKEk_vD3s/s1600/teamnameIE.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="95" src="http://2.bp.blogspot.com/-ZzrpCmLvOdM/Tidqh5Z_t0I/AAAAAAAAAEs/rhpKEk_vD3s/s320/teamnameIE.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Chrome &amp;amp; Firefox - I think this is the intended look&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-_9JWR9kEI7Y/TidqovYt1PI/AAAAAAAAAEw/ab0qhTpkJ8E/s1600/teamnameChrome.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="96" src="http://3.bp.blogspot.com/-_9JWR9kEI7Y/TidqovYt1PI/AAAAAAAAAEw/ab0qhTpkJ8E/s320/teamnameChrome.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;Keyboard Issues&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;With this new PC, I am having trouble typing a single quote ( ' ), which is required quite a lot in programming. If I press the key once, nothing happens. I then press it again an two of them appear and I have to delete one to get the single quote required.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;I'm sure there is a setting somewhere to prevent this as I've never had this issue before. Rather than me waste all day trying to figure it out, I'm hoping someone can point me in the right direction.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;Thanks in advance....&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-8357156436787141833?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/8357156436787141833/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/damn-computers-ii.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8357156436787141833'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8357156436787141833'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/damn-computers-ii.html' title='Damn Computers II'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-ZzrpCmLvOdM/Tidqh5Z_t0I/AAAAAAAAAEs/rhpKEk_vD3s/s72-c/teamnameIE.PNG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-6180138904485474543</id><published>2011-07-20T05:24:00.000-07:00</published><updated>2011-07-21T14:12:12.273-07:00</updated><title type='text'>Pump Up The Volume</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The HHP competition convinced me that it was time to upgrade my tired old computer. As data sets get bigger, having more RAM and a 64 bit operating system can make things a little quicker.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;For those who aren't aware, if you have a 32-bit operating system then there is no point having&amp;nbsp;more than 4GB of RAM, as&amp;nbsp;it can't be used. So if you&amp;nbsp;need more RAM, then you will also need a 64-bit operating system.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;I had a chat with the guy in the computer shop and he convinced me that it was going to be worth my while just to buy a new PC rather than upgrade my old one. My specifications were 16GB RAM, Windows 7 64-bit and&amp;nbsp;the latest i7 processors. Size was also a consideration - due to the impending take over of my office by my daughter and me being relegated to the cupboard under the stairs.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;I considered a laptop, but apparently the i7 processors that go in them are not as fast as the desktop versions. Instead I opted for a really small case, which increased the price as the case was more expensive and apparently you need a smaller motherboard to fit in them. A bog standard graphics card&amp;nbsp;was added and Bobs your&amp;nbsp;uncle.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The next day the shop rang me&amp;nbsp;- with some bad news.&amp;nbsp;The&amp;nbsp;power supply that is required for the i7 processors won't fit into the nice little case I had opted for, so they recommended another for me that is kind of normal size (but still half the size and weight of my incumbent that comes with 5 fans and neon lights!).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The downgrade in case saved me some money, and the total cost was&amp;nbsp;AU$1,300 - no monitor, keyboard or mouse included though. Incidentally, it would have still cost about $800 to upgrade my old PC (or so the salesman told me), and I now have 2 (actually about&amp;nbsp;8&amp;nbsp;as I&amp;nbsp;never seem to get around to getting rid of my old ones).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Keen to see what it could do, I ran my old algorithms through the new beast, and it seemed to accomplish the same tasks as my Win 7 64-bit i5 laptop in&amp;nbsp;&amp;nbsp;66% of the time. I then looked at the task manager to see what was going on under the hood.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The first thing I noticed is that with no applications running at all, just the operating system, it seems to be using 1.71GB, which seems a little excessive.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Z66nOrBuBks/TiasJK52QSI/AAAAAAAAAEg/uPHULp17fGA/s1600/taskmanagerA.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-Z66nOrBuBks/TiasJK52QSI/AAAAAAAAAEg/uPHULp17fGA/s1600/taskmanagerA.PNG" /&gt;&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;I then kicked off my algorithms and was a bit dismayed to see that&amp;nbsp;only 14% of the available CPU was being used - so essentially only 1 of the 8 cores.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-zsqjd2UbTVw/TiasTuGfIYI/AAAAAAAAAEk/sh7H2BYjUqk/s1600/taskmanager.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="171" src="http://1.bp.blogspot.com/-zsqjd2UbTVw/TiasTuGfIYI/AAAAAAAAAEk/sh7H2BYjUqk/s320/taskmanager.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;This is because the code I was running was doing things serially. It was doing a set of tasks that needed the prior task to be completed before the next one could begin. The extra processors would enable me to work on different tasks at the same time (such as browsing the Internet) - but the whole point of the investment was to speed up the number crunching.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;One solution to utilise the extra processing power would have been to run several algorithms at the same time - but I wanted one algorithm to complete 8 times quicker.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;A particularly popular algorithm is &lt;strong&gt;Random Forests&lt;/strong&gt;. Now this is essentially an ensemble of lots&amp;nbsp;of individual decision trees that are all built independently. There is no need to wait for one tree being&amp;nbsp;finished before we can start building the next - an ideal candidate for parallel processing.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&lt;span style="background-color: white;"&gt;It took me a while to hack together a solution, but eventually I got 100% CPU usage and my model built in 1/8th of the time&lt;span style="background-color: white;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-l8zmI74sq3w/Tiasa_MEsYI/AAAAAAAAAEo/NAwGl7WMMco/s1600/taskmanager1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="171" src="http://4.bp.blogspot.com/-l8zmI74sq3w/Tiasa_MEsYI/AAAAAAAAAEo/NAwGl7WMMco/s320/taskmanager1.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Below is the R code that will multi thread the building of Random Forests. The multithreading&amp;nbsp;piece came from &lt;a href="http://www.kaggle.com/c/overfitting/forums/t/443/how-to-parallelise-r-code-included"&gt;a generous Kaggle post in the Don't Overfit competition&lt;/a&gt;&amp;nbsp;by Chris Pardy, and with that bit of info, it was trivial to adapt for my purposes.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Incidentally, Chris posted this code completely unexpectedly in response to some other code I had already shared. This was a nice surprise and is why the Kaggle concept is such a great learning environment.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;There is now a dilemma though to sharing - a slight matter of $3million.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;From what I understand, in order to win one of the milestone cash prizes you essentially have to tell everyone what you did in a way that they can replicate your results. If you do this, you are essentially putting yourself at the bottom of the leaderboard, as everyone else has your solution plus their own.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The milestone prizes increase in value every 6 months, so would it be a good strategy for the leader to not take the&amp;nbsp;$30,000 in the first one and hold out for $50,000&amp;nbsp;6 months later or $60,000 12 months later, given that you will potentially be improving your model by learning from those who do take the money.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Any thoughts?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&lt;strong&gt;R code for parallel random forests&lt;/strong&gt;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border: 1px solid black; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;pre style="background: rgb(255, 255, 255); color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;######################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# some example code to multithread&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# the building of random forests&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;######################################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#load the required libraries&lt;/span&gt;&lt;br /&gt;library(randomForest)&lt;br /&gt;library(snowfall)    &lt;br /&gt;library(rlecuyer)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#load the data&lt;/span&gt;&lt;br /&gt;theData &amp;lt;- iris&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#set the formula&lt;/span&gt;&lt;br /&gt;theFormula &amp;lt;- as.formula(&lt;span style="color: #2a00ff;"&gt;"Species ~ . "&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#specify number of threads&lt;/span&gt;&lt;br /&gt;threads &amp;lt;- 8&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#total number of trees in the random forest&lt;/span&gt;&lt;br /&gt;totalTrees &amp;lt;- 1000&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#number of trees to build per thread&lt;/span&gt;&lt;br /&gt;trees &amp;lt;- as.&lt;span style="color: #7f0055; font-weight: bold;"&gt;integer&lt;/span&gt;(totalTrees / threads)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# now the funky stuff&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#the function each thread calls&lt;/span&gt;&lt;br /&gt;    parallelRF &amp;lt;- function(i){&lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;return&lt;/span&gt;(randomForest(theFormula, data=theData, ntree=trees))&lt;br /&gt;}    &lt;br /&gt; &lt;br /&gt; &lt;span style="color: #3f7f59;"&gt;# Initialise "cluster"&lt;/span&gt;&lt;br /&gt; sfInit(parallel = TRUE, cpus = threads, type = &lt;span style="color: #2a00ff;"&gt;"SOCK"&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt; &lt;span style="color: #3f7f59;"&gt;# Make data available to other R instances / nodes&lt;/span&gt;&lt;br /&gt; sfExport(list = c(&lt;span style="color: #2a00ff;"&gt;"theData"&lt;/span&gt;, &lt;span style="color: #2a00ff;"&gt;"theFormula"&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;"trees"&lt;/span&gt;))&lt;br /&gt;&lt;br /&gt; &lt;span style="color: #3f7f59;"&gt;# To load a library on each R instance / node&lt;/span&gt;&lt;br /&gt; sfClusterEval(library(randomForest))&lt;br /&gt; &lt;br /&gt; &lt;span style="color: #3f7f59;"&gt;# Use a parallel RNG to avoid correlated random numbers&lt;/span&gt;&lt;br /&gt; &lt;span style="color: #3f7f59;"&gt;# Requires library(rlecuyer) installed on all nodes&lt;/span&gt;&lt;br /&gt; sfClusterSetupRNG()&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# build the random forests&lt;/span&gt;&lt;br /&gt; allForests &amp;lt;- sfClusterApplyLB(1:threads, parallelRF)&lt;br /&gt;&lt;br /&gt; sfStop()&lt;br /&gt; &lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#everything finished, so&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#merge all the forests into one&lt;/span&gt;&lt;br /&gt; myRandomForest &amp;lt;- allForests[[1]]&lt;br /&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;for&lt;/span&gt;(i in 2:threads){&lt;br /&gt; myRandomForest &amp;lt;- combine(myRandomForest,allForests[[i]])&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#convince oursleves we have them all combined&lt;/span&gt;&lt;br /&gt;myRandomForest$ntree&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#what is important&lt;/span&gt;&lt;br /&gt;myRandomForest$importance&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-6180138904485474543?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/6180138904485474543/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/pump-up-volume.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6180138904485474543'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6180138904485474543'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/pump-up-volume.html' title='Pump Up The Volume'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-Z66nOrBuBks/TiasJK52QSI/AAAAAAAAAEg/uPHULp17fGA/s72-c/taskmanagerA.PNG' height='72' width='72'/><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-5843200201994253949</id><published>2011-07-14T14:28:00.000-07:00</published><updated>2011-07-17T17:05:00.101-07:00</updated><title type='text'>Small World in Melbourne</title><content type='html'>&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Having two young kids and no family in Melbourne (Australia)&amp;nbsp;to babysit, my evening social activities are very minimal at the moment. They normally revolve around analytic type user groups that start after work and last for a couple of hours - then home for bath time.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Last night was a double header.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&lt;b&gt;The Melbourne Tableau User Group&lt;/b&gt; had a meeting to demonstrate the new features of &lt;a href="http://www.tableausoftware.com/"&gt;Tableau&lt;/a&gt;&amp;nbsp;that are due to be released later this month. Tableau is a package that will draw you graphs and plot geographic data on maps. It is one of my favourite pieces of software for quickly drilling down&amp;nbsp;on issues&amp;nbsp;- you get to what you want very quickly. It costs ~$1,000 for a single licence&amp;nbsp;or $2,000&amp;nbsp;for an advanced licence that lets you connect to ODBC data sources.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;They gave away an IPad at the presentation - as one of the new features is dashboards for mobile devices. With only 27 people there I was feeling lucky, but #5 won (who incidentally already had one and was using it in the meeting to take notes), I was #9.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;I spoke to the new Melbourne based Tableau employee, and asked him what he did, as I thought the biggest sales people for Tableau are&amp;nbsp;the users themselves. He agreed - it is a great product and I've not heard a bad word about it.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The meeting was concluded with free beer and finger food that was nice. Here&lt;/span&gt;&lt;span style="font-family: Arial;"&gt;&amp;nbsp;I met&amp;nbsp;one of&amp;nbsp;only two previous Australian KDDCup winner&amp;nbsp; &lt;a href="http://www.sigkdd.org/kddcup/index.php?section=2002&amp;amp;method=result"&gt;Bhavani Raskutti&lt;/a&gt;, who I know very well (the other was the &lt;a href="http://www.kddcup-orange.com/winners.php"&gt;University of Melbourne&lt;/a&gt;).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Immediately as this event drew to a close at 6:30, it was 100m down the road to a &lt;a href="http://groups.google.com/group/datamelb?pli=1"&gt;&lt;b&gt;Data Melbourne&lt;/b&gt;&lt;/a&gt; talk. This one was even more social as a beer was thrust in my hand before the talk even began.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The reason I attended this event&amp;nbsp;(permission for a leave pass had to be booked in advance) was that the speaker was non other than Matt Kwan from the HHP team Planet Melbourne who are currently 6th on the leaderborad.&amp;nbsp;Matt talked about his PhD reasearch on using mobile phone data to analyse daytime population movements, which was really quite interesting. He then talked about the HHP, just basically introducing the concept of the competition.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;He didn't give a way any secrets, but some interesting outtakes were that he had never heard of Kaggle before and was drawn to this by the lure of big money. He was an accomplished programmer but had done no analytics and had to learn linear algebra from scratch and is coding all his own software and wishes he had a pc with more grunt. All in all he is doing very well.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;One interesting comment was that wouldn't it be more profitable to put all the effort into creating your own startup than spending all these hours for a potential but not very likely $500k. It is worth noting that several companies started up as a result of the Netflix Prize, including &lt;a href="http://www.commendo.at/"&gt;Commendo&lt;/a&gt;.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Melbourne is a small place. It is also the spiritual home of Kaggle. In fact I saw &lt;a href="http://ausdm09.togaware.com/schedule.pdf"&gt;Anthony Goldblooms first ever presentation of the Kaggle concept&lt;/a&gt; at AusDM 2009, well before it was actually launched (incidentally&amp;nbsp; I was running a &lt;a href="http://www.tiberius.biz/ausdm09/index.html"&gt;data mining&amp;nbsp;comp&lt;/a&gt; for this conference that was about blending solutions from the Netflix prize to get the best mega ensemble - may be handy for the HHP?).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;At an R user group about 12 months ago, Anthony and I were sat at the same table as one &lt;a href="http://www.kaggle.com/users/3256/jeremy-howard-kaggle"&gt;Jeremy Howard&lt;/a&gt;, who became intrigued with the Kaggle&amp;nbsp;concept and went on to become Kaggles most successful competitor and now part of the Kaggle team.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;So, small world in Melbourne - my adopted home and a great place to live and work.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-5843200201994253949?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/5843200201994253949/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/small-world-in-melbourne.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/5843200201994253949'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/5843200201994253949'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/small-world-in-melbourne.html' title='Small World in Melbourne'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-6636080900024215287</id><published>2011-07-08T15:25:00.000-07:00</published><updated>2011-07-08T15:29:31.954-07:00</updated><title type='text'>The Power of Twitter and Analyst First</title><content type='html'>&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;I don't use twitter, but it must be important as I have just bought a new Panasonic DVD player that can connect to the internet - but the only things you can seem to do are look at Twitter, YouTube, Picasso and the Weather. I also watched a presentation from a colleague recently and every 10 seconds these little messages kept popping up on his screen saying who had tweeted what. Not being&amp;nbsp; Twitter user I had not seen this before, but it makes you realise that information exchange is instantaneous these days.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;I am saying this as I had another big spike in views of this blog, in a specific half hour. Being a data miner and generally curious chap I obviously wanted to know what was the cause of this, or more precisely, if I had the skills to find out what the cause was.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;The plot below shows the spike... &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-maAkltJ6o9o/Thd0704pIwI/AAAAAAAAAEQ/6PvVG84iTLo/s1600/twitter1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="95" src="http://1.bp.blogspot.com/-maAkltJ6o9o/Thd0704pIwI/AAAAAAAAAEQ/6PvVG84iTLo/s320/twitter1.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;and it occurred in a specific half hour exactly 19 hours ago... &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-vuZts_j0vSg/Thd1MLagf6I/AAAAAAAAAEU/-4LLD3S1p_w/s1600/twitter2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="96" src="http://4.bp.blogspot.com/-vuZts_j0vSg/Thd1MLagf6I/AAAAAAAAAEU/-4LLD3S1p_w/s320/twitter2.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;and a look at the web stats for the day pointed me to the source...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-iHWDOwcvip8/Thd2AFnMOcI/AAAAAAAAAEY/S5M18QPgP-8/s1600/twitter4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="132" src="http://2.bp.blogspot.com/-iHWDOwcvip8/Thd2AFnMOcI/AAAAAAAAAEY/S5M18QPgP-8/s320/twitter4.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;and then a search of Twitter pointed me to the guilty party...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-2LnOPEpQf4o/Thd2aiCNErI/AAAAAAAAAEc/KUEjea4rM54/s1600/twitter3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="147" src="http://2.bp.blogspot.com/-2LnOPEpQf4o/Thd2aiCNErI/AAAAAAAAAEc/KUEjea4rM54/s320/twitter3.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Note this was tweeted exactly 19 hours ago, the exact time the spike occurred.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Now this is all very interesting you may say, but so what?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;My point is that you need a curious mind to be a good data scientist - not expensive software.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;In business today, those who make the decisions think that buying the most expensive analytic software is going to solve all the problems know to man, and you can pluck any monkey out to use it. To me this is putting the cart before the horse. If you want the best results, this could not be further from the truth. You need to spend money on the best analysts first and then think about the software.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;I am currently 2nd in the HHP leaderboard, and have not used&amp;nbsp; any software that was not freely available or written by myself to generate my predictions (except for some commercial compilers bought years ago and maybe Windows if you count that).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;There is a new movement called Analyast First that has started up here in Australia. If you are an analytic manager I recommend watching the video below.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;a href="http://analystfirst.com/2011/06/15/87/analyst-first-101/#.Thd4NyTXG5M.blogger"&gt;Analyst First 101&lt;/a&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-6636080900024215287?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/6636080900024215287/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/power-of-twitter-and-analyst-first.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6636080900024215287'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6636080900024215287'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/07/power-of-twitter-and-analyst-first.html' title='The Power of Twitter and Analyst First'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-maAkltJ6o9o/Thd0704pIwI/AAAAAAAAAEQ/6PvVG84iTLo/s72-c/twitter1.PNG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-6558750249568251556</id><published>2011-06-30T14:14:00.000-07:00</published><updated>2011-07-01T02:44:48.744-07:00</updated><title type='text'>If you build it, they will come</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Or will they?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;(Apparently, the actual quote is&amp;nbsp;"&lt;i&gt;If you build it, he will come" &lt;/i&gt;from &lt;/span&gt;&lt;a class="extiw" href="http://en.wikipedia.org/wiki/Field_of_Dreams" title="w:Field of Dreams"&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;i&gt;Field of Dreams&lt;/i&gt;&lt;/span&gt;&lt;/a&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;i&gt;.)&lt;/i&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;I figured if I am to make money out of this blog by writing about the HHP, then it would be a good advert if I was actually doing well in the competiton. I recently have been putting effort in&amp;nbsp;on this front, and team Sali Mali&amp;nbsp;eventually&amp;nbsp;got to the top of the leaderboard.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Being a data geek, I looked at the subsequent stats of this blog and saw a huge spike for a particular half hour around the time I got to the top.&amp;nbsp; 'Bingo' I thought - thats the way to generate traffic.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;I then looked at my linkedin stats, as the only real&amp;nbsp;way to get to this blog is via my Kaggle profile, which will take you to my linkedin page and then to this blog. Surprisingly there was no such spike there&amp;nbsp; - so what was going on?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Blog Views:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-7PnsA5g4ung/Tgzb_HgJaRI/AAAAAAAAAEI/4BIH6jzilAo/s1600/blogstats1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="121" i$="true" src="http://3.bp.blogspot.com/-7PnsA5g4ung/Tgzb_HgJaRI/AAAAAAAAAEI/4BIH6jzilAo/s320/blogstats1.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;LinkedIn views:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-JsoHZV4ZCnU/TgzgWLEpH0I/AAAAAAAAAEM/qHen9cPa-ak/s1600/linkedin.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="238" i$="true" src="http://4.bp.blogspot.com/-JsoHZV4ZCnU/TgzgWLEpH0I/AAAAAAAAAEM/qHen9cPa-ak/s320/linkedin.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;As I don't know what exact time zones everything is in, my blog stats pointed me to this post on the kaggle blog, which&amp;nbsp;I am assuming is the cause for the spike.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.heritagehealthprize.com/c/hhp/forums/t/664/cross-validation-discrepancies/4381#post4381"&gt;http://www.heritagehealthprize.com/c/hhp/forums/t/664/cross-validation-discrepancies/4381#post4381&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Anyway, the data mining point of all this is that sometimes people are quick to jump to conclusions that are completely wrong. The real answers are always in the data - which is why I think the HHP will be won by a data scientist and prior expert medical knowledge will pay no part at all.&amp;nbsp;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-6558750249568251556?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/6558750249568251556/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/06/if-you-build-it-they-will-come.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6558750249568251556'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6558750249568251556'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/06/if-you-build-it-they-will-come.html' title='If you build it, they will come'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-7PnsA5g4ung/Tgzb_HgJaRI/AAAAAAAAAEI/4BIH6jzilAo/s72-c/blogstats1.PNG' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-8569238205892646505</id><published>2011-06-18T14:27:00.000-07:00</published><updated>2011-06-18T14:34:46.185-07:00</updated><title type='text'>The more you practice, the luckier you become</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Chris R has now opened up a&amp;nbsp;gap at the top of the leaderboard. It may look like nothing, but when you plot it, it is quite significant.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-4VlfZ4D2dUs/Tf0Rd0sOEyI/AAAAAAAAAEA/7q5f_9o7Muo/s1600/leaderboard+19+June.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" i$="true" src="http://2.bp.blogspot.com/-4VlfZ4D2dUs/Tf0Rd0sOEyI/AAAAAAAAAEA/7q5f_9o7Muo/s320/leaderboard+19+June.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;If you plot the number of entries per team, we see that progress seems to come with more effort. Chris R and a few other teams are almost doing daily submissions.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-iX7yC6dObAE/Tf0QTUaAsPI/AAAAAAAAAD4/QOlHwdqjji8/s1600/leaderboard+error+v+submissions.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="221" i$="true" src="http://3.bp.blogspot.com/-iX7yC6dObAE/Tf0QTUaAsPI/AAAAAAAAAD4/QOlHwdqjji8/s320/leaderboard+error+v+submissions.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;I am now progressing steadily as I investigate more of the data and more algorithms. I have gone from 40th to 19th by just refining the algorithm, rather than adding&amp;nbsp;any&amp;nbsp;different data.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The&amp;nbsp;technique I am using to determine if things are improving are quite simple. Rather than have an independent set to test the errors against, I randomly split the set in half, build a model on&amp;nbsp;one set&amp;nbsp;and then calculate the error on the rest. The error on the building set is the 'in bag error' and the error on the unseen set is 'out of bag' error. This random splitting is repeated many times with the predictions just averaged. What I am looking for is the OOB error&amp;nbsp;to go&amp;nbsp;down with improvements in the algorithm. This seems to work, and the leaderboard scores improve as by OOB scores improve, although the leaderboard error is about 0.01 higher than my OOB&amp;nbsp;error.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;For a sanity check, you can plot the distributions of the predictions by in bag, out of bag and leaderboard set. The plot below shows a remarkable similarity, meaning the data distributions in the sets are very similar.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-D2i5NdXEjT4/Tf0Qo1Xx3cI/AAAAAAAAAD8/uNR-V1sDFYE/s1600/SCORE_DISTRIBUTION.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="184" i$="true" src="http://3.bp.blogspot.com/-D2i5NdXEjT4/Tf0Qo1Xx3cI/AAAAAAAAAD8/uNR-V1sDFYE/s320/SCORE_DISTRIBUTION.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-8569238205892646505?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/8569238205892646505/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/06/more-you-practice-luckier-you-become.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8569238205892646505'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8569238205892646505'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/06/more-you-practice-luckier-you-become.html' title='The more you practice, the luckier you become'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-4VlfZ4D2dUs/Tf0Rd0sOEyI/AAAAAAAAAEA/7q5f_9o7Muo/s72-c/leaderboard+19+June.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-7077258126096556385</id><published>2011-06-08T14:38:00.001-07:00</published><updated>2011-06-08T16:10:53.003-07:00</updated><title type='text'>Data Anomalies</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;This is a place where I will put some quirks of the HHP data that may or may not&amp;nbsp;be relevant in wining the&amp;nbsp;$3million&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;1) Be aware of the claims truncated&amp;nbsp;flag for&amp;nbsp;the claims&amp;nbsp;in Y1.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-SKEIHEzKim4/Te_r1IOdwKI/AAAAAAAAADw/KXRMO6bKNW4/s1600/claimstruncated.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="185" src="http://3.bp.blogspot.com/-SKEIHEzKim4/Te_r1IOdwKI/AAAAAAAAADw/KXRMO6bKNW4/s320/claimstruncated.png" t8="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;2) There are some who have claimed for &lt;a href="http://en.wikipedia.org/wiki/Pediatrics"&gt;pediatrics&lt;/a&gt; who are clearly not children. This might indicate family accounts or incorrect age?&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-xOHbgJioYC4/TfAA39fGhVI/AAAAAAAAAD0/tyJN54KXxwQ/s1600/pediatrics.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-xOHbgJioYC4/TfAA39fGhVI/AAAAAAAAAD0/tyJN54KXxwQ/s320/pediatrics.png" t8="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;﻿&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-7077258126096556385?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/7077258126096556385/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/06/data-anomalies.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7077258126096556385'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7077258126096556385'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/06/data-anomalies.html' title='Data Anomalies'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-SKEIHEzKim4/Te_r1IOdwKI/AAAAAAAAADw/KXRMO6bKNW4/s72-c/claimstruncated.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-2495900062289005451</id><published>2011-06-05T18:50:00.000-07:00</published><updated>2011-06-07T03:30:53.662-07:00</updated><title type='text'>Scraping Up the Leaderboard</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;b&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Leaderboards in analytic competitions can be very revealing.&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;From a competitor viewpoint, you can see if you are in the right ballpark or completely missing something that others have obviously spotted.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;If you are the competitor who is in the lead, then revealing your hand can be very detrimental as it encourages others to try harder and you tend to take your own foot off the gas if you are ahead.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;I have been on both ends...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;The right end&amp;nbsp; a)&lt;/b&gt;&lt;/span&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The pakdd09 Data Mining Competition was about predicting if credit card customers would pay their bills. Similar to the HHP, you had to predict into the future. &lt;/span&gt;&lt;a href="http://sede.neurotech.com.br/PAKDD2009/leaderboard.do?method=load"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;You can see the leader board here&lt;/span&gt;&lt;/a&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Now the team at the top of the training data leaderboard, 'Tiberius Data Mining Solutions' is my team, but we weren't always there.&amp;nbsp;&amp;nbsp;You will notice the team in 2nd place, Suncorp, is way ahead of the field and they had obviously spotted something no one else had. Curiosity to figure out exactly what it was eventually lead us to discover new insights and overtake them. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Curiously in this competition we ended up 4th and the winners came from nowhere on the training data leaderboard to win on the out of time dataset.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;The right end&amp;nbsp; b)&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-LTrEW3TiPx8/TewqD-yfyUI/AAAAAAAAADo/09FMjdJoP6Q/s1600/tourism2leaderboard.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://3.bp.blogspot.com/-LTrEW3TiPx8/TewqD-yfyUI/AAAAAAAAADo/09FMjdJoP6Q/s320/tourism2leaderboard.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The plot above shows the final standings for the Tourism2 Kaggle competiton. On the training leaderboard, Kaggles own Jeremy Howard (in his pre Kaggle days) was in front by a country mile after only a handful of submissions. This set a benchmark to aspire to, and pushed others to try to figure out what Jeremy had discovered. In the end I got to 2nd place on the leaderboard but was still way off Jeremy. Eventually though on the holdout sample, my submission did prove to be the winner.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&lt;b&gt;The Wrong End&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-2IGGRBqKUDI/TewrzRDIldI/AAAAAAAAADs/kHOAP2KBWCA/s1600/InformsleaderboardR.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://3.bp.blogspot.com/-2IGGRBqKUDI/TewrzRDIldI/AAAAAAAAADs/kHOAP2KBWCA/s320/InformsleaderboardR.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;The Kaggle Informs contest was about predicting stock prices. This was one where I was surprised to make a submission that shot to the top of the leaderboard by a wide margin and stayed there for a while until the last few hours when I was overtaken by 3 teams. I had taken my eye off the ball and but would like to think had encouraged the eventual winners to squeeze every last bit of juice out of the data.&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;Leaderboard Plots&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;a href="http://blog.kaggle.com/2011/04/24/the-deloittefide-chess-competition-play-by-play/"&gt;this blog post&lt;/a&gt; on the recent Chess comp is very interesting and studies how the leaderboard evolved through time and how big leaps in scores are made.&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;I ran a recent Kaggle competition called 'Don't Overfit' and &lt;a href="http://www.kaggle.com/c/overfitting/forums/t/325/leaderboard"&gt;posted some leaderbaord plots&lt;/a&gt; to the forum during the course of the comp. The way I did this seemed quick the first time around - just copy and paste into Excel and plot a graph. Doing this as a one off got me to see what I wanted fast, but it became tedious in time, doing it every week.&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;I then figured this task could be automated, and a bit of Googling soon provided me with the clues I needed to proceed. If I run the R code below, it will plot me the current leaderboard for any Kaggle comp and show me where I am (and how much more effort I need to put in).&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;What is interesting about the current scores is that maybe the top 5 teams are a bit ahead, but everyone else is on a nice trajectory, unlike some of the other leaderboards we have seen where there are marked step increases.&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;My own score is going slowly in the right direction as I explore more of the data and tighten up the techniques, but I am going backwards as others are doing this at a quicker pace!&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-0_zMlQoT_xQ/TewpWwXpBrI/AAAAAAAAADk/vUshBKNeTFg/s1600/HHPleaderboardR.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/-0_zMlQoT_xQ/TewpWwXpBrI/AAAAAAAAADk/vUshBKNeTFg/s320/HHPleaderboardR.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The R code below will plot the live leaderboard positions as shown above&lt;/span&gt;...&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow: auto; width: 500px;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;##########################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# Some R function to plot your position&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# on a Kaggle leaderboard&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;##########################################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;plotKaggleLeaderboard &amp;lt;- function(theURL,myTeam,topX=100,title){&lt;br /&gt;    &lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#this library does all the hard work&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#for windows users to install this packages see&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#&lt;/span&gt;&lt;span style="color: #3f3fbf;"&gt;http://cran.r-project.org/bin/windows/contrib/r-release/ReadMe&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#and &lt;/span&gt;&lt;span style="color: #3f3fbf;"&gt;http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.13/&lt;/span&gt;&lt;br /&gt;library(XML)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#go and read the tables from the web page&lt;/span&gt;&lt;br /&gt;tables &amp;lt;- readHTMLTable(theURL)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#get the table of interest&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#names(tables)&lt;/span&gt;&lt;br /&gt;lb &amp;lt;- tables[[&lt;span style="color: #2a00ff;"&gt;'leaderboard-table'&lt;/span&gt;]]&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#see what the columns are&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#colnames(lb)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#nrow(lb)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#convert to numeric - see ?readHTMLTable&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#numeric_cols &amp;lt;- c('AUC','Entries')&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#numeric_cols &amp;lt;- c('RMSLE','Entries')&lt;/span&gt;&lt;br /&gt;numeric_cols &amp;lt;- c(4,5)&lt;br /&gt;lb[numeric_cols] = lapply(lb[numeric_cols], function(&lt;span style="color: #7f0055; font-weight: bold;"&gt;x&lt;/span&gt;) as.numeric(gsub(&lt;span style="color: #2a00ff;"&gt;".* "&lt;/span&gt;, &lt;span style="color: #2a00ff;"&gt;""&lt;/span&gt;, as.character(&lt;span style="color: #7f0055; font-weight: bold;"&gt;x&lt;/span&gt;))))&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#extract the team name from a messy field&lt;/span&gt;&lt;br /&gt;team_col &amp;lt;- c(&lt;span style="color: #2a00ff;"&gt;'Team Name'&lt;/span&gt;)&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#lb[team_col]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#split the field by "\r\n" than denotes the break between fields within a field &lt;/span&gt;&lt;br /&gt;lb[,team_col] &amp;lt;- sapply(strsplit(as.character(lb[,team_col]), &lt;span style="color: #2a00ff;"&gt;"[&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\r&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\n&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;]"&lt;/span&gt;), function (&lt;span style="color: #7f0055; font-weight: bold;"&gt;x&lt;/span&gt;) &lt;span style="color: #7f0055; font-weight: bold;"&gt;x&lt;/span&gt;[1]) &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#####################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#      plot&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#####################&lt;/span&gt;&lt;br /&gt;myRank &amp;lt;- which(lb[team_col] == myTeam)&lt;br /&gt;mytext = paste(&lt;span style="color: #2a00ff;"&gt;"Team"&lt;/span&gt;,myTeam,&lt;span style="color: #2a00ff;"&gt;"is in position"&lt;/span&gt;,myRank,&lt;span style="color: #2a00ff;"&gt;"out of"&lt;/span&gt;,nrow(lb))&lt;br /&gt;&lt;br /&gt;error_data &amp;lt;- lb[,4]&lt;br /&gt;&lt;br /&gt;plot(error_data[1:topX]&lt;br /&gt;    ,col = &lt;span style="color: #2a00ff;"&gt;'blue'&lt;/span&gt;&lt;br /&gt;    ,main = paste(title,&lt;span style="color: #2a00ff;"&gt;"- leaderboard at"&lt;/span&gt;,Sys.Date())&lt;br /&gt;    , &lt;span style="color: #7f0055; font-weight: bold;"&gt;sub&lt;/span&gt; = mytext&lt;br /&gt;    , xlab = &lt;span style="color: #2a00ff;"&gt;'Rank'&lt;/span&gt;&lt;br /&gt;    , ylab = &lt;span style="color: #2a00ff;"&gt;'Error'&lt;/span&gt;&lt;br /&gt;    )&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#mark my position&lt;/span&gt;&lt;br /&gt;points(myRank,error_data[myRank],col=&lt;span style="color: #2a00ff;"&gt;"red"&lt;/span&gt;,pch=19)&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# End of Function&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# plotKaggleLeaderboard&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;compURL &amp;lt;- &lt;span style="color: #2a00ff;"&gt;"&lt;/span&gt;&lt;span style="color: #3f3fbf;"&gt;http://www.heritagehealthprize.com/c/hhp/Leaderboard&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;"&lt;/span&gt;&lt;br /&gt;compTeam &amp;lt;- &lt;span style="color: #2a00ff;"&gt;'Sali Mali'&lt;/span&gt;&lt;br /&gt;compTopX &amp;lt;- 125&lt;br /&gt;compTitle &amp;lt;- &lt;span style="color: #2a00ff;"&gt;'HHP'&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;plotKaggleLeaderboard(theURL = compURL&lt;br /&gt;                      ,myTeam = compTeam&lt;br /&gt;                      ,topX = compTopX&lt;br /&gt;                      ,title = compTitle &lt;br /&gt;                      )&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-2495900062289005451?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/2495900062289005451/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/06/scraping-up-leaderboard.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/2495900062289005451'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/2495900062289005451'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/06/scraping-up-leaderboard.html' title='Scraping Up the Leaderboard'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-LTrEW3TiPx8/TewqD-yfyUI/AAAAAAAAADo/09FMjdJoP6Q/s72-c/tourism2leaderboard.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-6294483201120320037</id><published>2011-05-30T22:54:00.000-07:00</published><updated>2011-05-30T23:01:37.152-07:00</updated><title type='text'>There’s more than one way to skin a cat</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/minimise-what.html"&gt;In a previous post&lt;/a&gt; we&amp;nbsp;examined at the strange looking log function that we are supposed to minimise to win the $3 million HHP and showed that all errors are not equal with this metric.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;For example, a difference of 0.1 between your prediction and the actual value is more severely punished if the actual is 0 than if the actual is 15. The point of this is because it is more important to know if the customer is going to go to hospital or not than knowing exactly how many days they will be in hospital.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Now this 'non-standard'&amp;nbsp;way the model is being judged&amp;nbsp;might cause issues if we are using standard algorithms that seek to minimise different error functions. For example, in linear regression, the algorithm seeks to find a solution that minimises the sum of the squared errors, with no regard for what the actual value&amp;nbsp;is;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;sqrt&lt;span style="color: blue;"&gt;( ( &lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;strong&gt;&lt;span style="color: red;"&gt;pred&lt;/span&gt; &lt;/strong&gt;- &lt;/span&gt;&lt;span style="color: #6aa84f; font-size: x-small;"&gt;&lt;strong&gt;act &lt;/strong&gt;&lt;/span&gt;&lt;span style="color: blue; font-size: x-small;"&gt;&lt;span style="color: blue; font-size: x-small;"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt; ^ &lt;/span&gt;&lt;span style="color: blue; font-size: x-small;"&gt;&lt;span style="color: blue; font-size: x-small;"&gt;2 )&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div align="left" style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="color: black; font-family: Arial, Helvetica, sans-serif; font-size: small;"&gt;So, does this mean that we have to come up with a new algorithm that seeks to minimise this specific log function, or is there&amp;nbsp;anything else that can be done so that we don't have to reinvent the wheel?&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Now the error function we are asked to minimise is &lt;em&gt;very &lt;/em&gt;similar,&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;sqrt&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;span style="color: blue; font-size: x-small;"&gt;&lt;span style="color: blue; font-size: x-small;"&gt;(&amp;nbsp; (&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;span style="color: red;"&gt;&lt;strong&gt;&lt;span style="font-size: x-small;"&gt;log&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-size: x-small;"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;pred+&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-size: x-small;"&gt;1)&lt;/span&gt;&lt;/span&gt;&lt;/strong&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt; - &lt;span style="color: #6aa84f;"&gt;&lt;strong&gt;log&lt;/strong&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="color: #6aa84f;"&gt;&lt;strong&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-size: x-small;"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;act+&lt;/span&gt;&lt;/strong&gt;&lt;/span&gt;&lt;span style="color: blue; font-size: x-small;"&gt;&lt;span style="color: blue; font-size: x-small;"&gt;&lt;span style="color: #6aa84f;"&gt;&lt;strong&gt;1)&amp;nbsp; &lt;/strong&gt;&lt;/span&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt; ^ &lt;/span&gt;&lt;span style="color: blue; font-size: x-small;"&gt;&lt;span style="color: blue; font-size: x-small;"&gt;2&amp;nbsp; )&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-family: Arial; font-size: small;"&gt;Note that the fact we have taken the log(act+1) is what introduces the issue that the magnitude of the actual value plays a part. The plot below shows&amp;nbsp;x &lt;em&gt;v&lt;/em&gt; log(x+1). A small change in x will give different changes in log(x+1) depending on what x was in the first place (ie the gradient of the line).&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-lCgxeRmO8hE/TeR2l8ArtBI/AAAAAAAAADQ/OoUWBI5wIm0/s1600/logx.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://1.bp.blogspot.com/-lCgxeRmO8hE/TeR2l8ArtBI/AAAAAAAAADQ/OoUWBI5wIm0/s320/logx.png" t8="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-family: Arial; font-size: small;"&gt;So, what to do?&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-family: Arial; font-size: small;"&gt;If we look at the two equations above we can see that they are the same if we substitute &lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-family: Arial; font-size: small;"&gt;pred1 = log(pred + 1) and act1 = log(act + 1)&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif; font-size: small;"&gt;So, if rather than predict DIH, we predict log(DIH + 1), then we can use least squares minimisation, and use some standard algorithms that&amp;nbsp;have already been developed for us.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif; font-size: small;"&gt;What we need to remember though is that when we make a submission we have put the prediction back on the right scale. &lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-family: Arial; font-size: small;"&gt;DIH_tran = log(DIH + 1)&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-family: Arial; font-size: small;"&gt;DIH + 1 = exp(DIH_tran)&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-family: Arial; font-size: small;"&gt;DIH = exp(DIH_tran) - 1&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;/div&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;If we plot the data with these adjustments we can see how the error surface shows&amp;nbsp;parallel lines, meaning all errors are equal, and least squares&amp;nbsp;minimisation will work OK.&amp;nbsp;When&amp;nbsp;we put the errors back on the original&amp;nbsp;scale, all errors are not equal.&amp;nbsp;&lt;/span&gt; &lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-on_tLtnGzeE/TeR_ZezST1I/AAAAAAAAADU/yoRnIbNizD4/s1600/adjusteddata.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://3.bp.blogspot.com/-on_tLtnGzeE/TeR_ZezST1I/AAAAAAAAADU/yoRnIbNizD4/s320/adjusteddata.png" t8="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-kSvLDdVLkow/TeR_eQ_4juI/AAAAAAAAADY/UOp_0XMPA8c/s1600/unadjusteddata.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://3.bp.blogspot.com/-kSvLDdVLkow/TeR_eQ_4juI/AAAAAAAAADY/UOp_0XMPA8c/s320/unadjusteddata.png" t8="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;#############################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# NOW ADJUSTING THE TARGET SO WE CAN&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# MINIMISE THE RMSE&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#############################################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# generate the data&lt;/span&gt;&lt;br /&gt;dih &amp;lt;- seq(from=0, to=15, by = 0.05)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#adjusted days in hospital&lt;/span&gt;&lt;br /&gt;dih1 &amp;lt;- &lt;span style="color: #7f0055; font-weight: bold;"&gt;log&lt;/span&gt;(dih + 1)&lt;br /&gt;dat &amp;lt;- expand.grid(act1 = dih1, pred1 = dih1)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#adjusted error function&lt;/span&gt;&lt;br /&gt;dat$err1 &amp;lt;- &lt;span style="color: #7f0055; font-weight: bold;"&gt;sqrt&lt;/span&gt;((dat$pred1 - dat$act1) ^ 2)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#'un-adjust' the days in hospital&lt;/span&gt;&lt;br /&gt;dat$act &amp;lt;- &lt;span style="color: #7f0055; font-weight: bold;"&gt;exp&lt;/span&gt;(dat$act1) - 1 &lt;br /&gt;dat$pred &amp;lt;- &lt;span style="color: #7f0055; font-weight: bold;"&gt;exp&lt;/span&gt;(dat$pred1) - 1&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#plot the error on the adjusted scale&lt;/span&gt;&lt;br /&gt;contourplot(err1 ~ act1 * pred1, data = dat&lt;br /&gt;          ,region = TRUE&lt;br /&gt;          ,cuts = 10&lt;br /&gt;          ,col.regions = terrain.colors&lt;br /&gt;          )&lt;br /&gt;    &lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#plot the real scale&lt;/span&gt;&lt;br /&gt;contourplot(err1 ~ act * pred, data = dat&lt;br /&gt;          ,region = TRUE&lt;br /&gt;          ,cuts = 10&lt;br /&gt;          ,col.regions = terrain.colors&lt;br /&gt;          )&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-6294483201120320037?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/6294483201120320037/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/theres-more-than-one-way-to-skin-cat.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6294483201120320037'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6294483201120320037'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/theres-more-than-one-way-to-skin-cat.html' title='There’s more than one way to skin a cat'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-lCgxeRmO8hE/TeR2l8ArtBI/AAAAAAAAADQ/OoUWBI5wIm0/s72-c/logx.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-8526591369164804784</id><published>2011-05-29T00:43:00.000-07:00</published><updated>2011-05-29T02:58:07.688-07:00</updated><title type='text'>Hunting the Hidden Dimensions</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Last week I came across a documentary on TV called 'Hunting the Hidden Dimensions'. Now generally there is nothing worth watching on Australian TV (unless you are into cooking or talent contests) but this&amp;nbsp;one actually got me interested (although I must admit to falling asleep half way through due to the late for me time slot).&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.sbs.com.au/documentary/program/huntingthehiddendimensions/index"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;http://www.sbs.com.au/documentary/program/huntingthehiddendimensions/index&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The program was about fractals and the hidden order that appears in everything in nature. I didn't realise that fractal geometry was involved in film animation to generate mountain ranges and volcanic lava flow!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;One thing that did strike an accord was how Mandelbrot, when he first published his work, got a bit of a thumbs down from the academic mainstream because the ideas didn't conform to the nice&amp;nbsp;orderly nature of the world people were used to (ie the type of mathematics the Egyptians used to build the pyramids). The thing is, fractals are perfectly orderly, but looked at in a way that was different to the norm.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;This is in a data mining context&amp;nbsp;is similar to neural networks, with many&amp;nbsp;people who should know better saying they are uninterpretable black boxes.&amp;nbsp;When neural networks give improved predictions, they have found 'hidden&amp;nbsp;dimensions' that are very easy&amp;nbsp;to find if you are willing to look. Like Mandelbrot, you will need to&amp;nbsp;look at data&amp;nbsp;visually&amp;nbsp;rather than mathematically in order to figure things out.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;You can watch the documentary online.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-8526591369164804784?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/8526591369164804784/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/hubting-hidden-dimensions.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8526591369164804784'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8526591369164804784'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/hubting-hidden-dimensions.html' title='Hunting the Hidden Dimensions'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-6577597193095458454</id><published>2011-05-21T22:08:00.000-07:00</published><updated>2011-05-29T03:16:57.857-07:00</updated><title type='text'>Minimise the what?</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The other week I was on Amazon to purchase &lt;a href="http://www.liaad.up.pt/~ltorgo/DataMiningWithR/"&gt;Data Mining with R,&lt;/a&gt; and Amazon did some cross-selling, telling me I might be interested in &lt;a href="http://had.co.nz/ggplot2/book/"&gt;ggplot2&lt;/a&gt;. Now I remember &lt;a href="http://www.r-bloggers.com/getting-into-shape-for-the-sport-of-data-science-screencast-of-talk-by-jeremy-howard-at-melbourne-r-users/"&gt;listening to a talk by Jeremy Howard&lt;/a&gt;, and seemed to recall he mentioned this book was worth a read, so I ticked the approproate box, and now the books are here.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;I prefer looking at pictures to interpret information, so I wanted to see if I could use my new book to figure out what this log function was all about that the HHP prize people want us to minimize. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The following plots are all variations of the same data, plotted with different packages and functions. The generated data is basically what the errors&amp;nbsp;would be&amp;nbsp;for all possible combinations of Actual&amp;nbsp;Days In Hospital&amp;nbsp;v Predicted Days in Hospital.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Now generally, if you predict 15 and the real answer is 14, that is considered just as good as if you predict 2 and the answer is 1 - both have the same error - 1. Many methods of determining what is a good prediction, such as the mean squared error (MSE), don't care about the magnitude of the prediction, just the magnitude of the error.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Other metrics such as mean absolute percentage error (MAPE) do consider the magnitude. In electric load forecasting, if the actual load is 2MegaWatts (MW) and you&amp;nbsp;have forecast&amp;nbsp;1MW, it can be far more serious than if the actual load is 900MW and you have forecast 899MW. In both cases the error is only 1MW but the consequences can be quite different.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;So what does the error function of the HHP tell us?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Basically it says that we are penalised less for the same absolute error the higher the actual Days in Hospital is. So rather than concentrate on improving the prediction for someone who was 15 days in hospital and we only predicted 14 days, we would get far more reward by putting the effort into improving the prediction for someone who was 2 days in hosptal and we only predicted 1 day. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;or...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;a small improvement in accuracy for a small Days in Hospital is worth as much as a large improvement in accuracy for a large Days in Hospital.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Now all this insight was gleaned from just creating a graph. No Maths!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-F1XG5YDnGYs/TdiP-3LbKcI/AAAAAAAAACw/gJopnJ_Xxc4/s1600/errorsurfaceA.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" j8="true" src="http://1.bp.blogspot.com/-F1XG5YDnGYs/TdiP-3LbKcI/AAAAAAAAACw/gJopnJ_Xxc4/s320/errorsurfaceA.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-py5XfiCUSTY/TdiQES2YIrI/AAAAAAAAAC0/qWQ8unpaZUI/s1600/errorsurfaceB.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" j8="true" src="http://2.bp.blogspot.com/-py5XfiCUSTY/TdiQES2YIrI/AAAAAAAAAC0/qWQ8unpaZUI/s320/errorsurfaceB.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-tyD7sPmWpn4/TdiQKyISbfI/AAAAAAAAAC4/6rhum5oZQ0E/s1600/errorsurfaceC.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" j8="true" src="http://4.bp.blogspot.com/-tyD7sPmWpn4/TdiQKyISbfI/AAAAAAAAAC4/6rhum5oZQ0E/s320/errorsurfaceC.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-atR0gYH9Vww/TdiQPhqAzqI/AAAAAAAAAC8/lymgdFClhQU/s1600/errorsurfaceD.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" j8="true" src="http://4.bp.blogspot.com/-atR0gYH9Vww/TdiQPhqAzqI/AAAAAAAAAC8/lymgdFClhQU/s320/errorsurfaceD.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-3EJ39XVK2Q0/TdiQUdfi7zI/AAAAAAAAADA/AR2wwXjoXoA/s1600/errorsurfaceE.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" j8="true" src="http://2.bp.blogspot.com/-3EJ39XVK2Q0/TdiQUdfi7zI/AAAAAAAAADA/AR2wwXjoXoA/s320/errorsurfaceE.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;Now if we zoom in to what is the business end (&lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/learning-from-leaderboard-part-1.html"&gt;remember 0.2 is a good guess&lt;/a&gt;) we see things are a bit more linear...&lt;/span&gt;﻿&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-3rxXZ7yH0dU/TeIcckWyFRI/AAAAAAAAADM/VoxPdmgrekQ/s1600/errorsurfaceZoomed.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/-3rxXZ7yH0dU/TeIcckWyFRI/AAAAAAAAADM/VoxPdmgrekQ/s320/errorsurfaceZoomed.png" t8="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&amp;nbsp;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;And here is what a we get ﻿for a normal type of error (just strip out the log and +1 from the data generation function). Note the lines are parallel.&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-FUqkwVQ-Fyg/TdmUu4aBcQI/AAAAAAAAADE/s_yFc8OhVYI/s1600/errorsurfaceNormal.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="176" j8="true" src="http://2.bp.blogspot.com/-FUqkwVQ-Fyg/TdmUu4aBcQI/AAAAAAAAADE/s_yFc8OhVYI/s320/errorsurfaceNormal.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;and for percentage errors, note the lines are converging, similar to the log error, but the worse position to be in is making high predictions when the actual is low.&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-rS6meUCSYEk/TdmX7JsZkMI/AAAAAAAAADI/Yg1C_pKLJnk/s1600/errorsurfaceMAPE.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="176" j8="true" src="http://1.bp.blogspot.com/-rS6meUCSYEk/TdmX7JsZkMI/AAAAAAAAADI/Yg1C_pKLJnk/s320/errorsurfaceMAPE.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Here is the R code to generate the plots...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;#############################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# generate the data&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#############################################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;dih &amp;lt;- seq(from=0, to=15, by = 0.2)&lt;br /&gt;dat &amp;lt;- expand.grid(act = dih, pred = dih)&lt;br /&gt;dat$err &amp;lt;- &lt;span style="color: #7f0055; font-weight: bold;"&gt;sqrt&lt;/span&gt;((&lt;span style="color: #7f0055; font-weight: bold;"&gt;log&lt;/span&gt;(dat$pred+1) - &lt;span style="color: #7f0055; font-weight: bold;"&gt;log&lt;/span&gt;(dat$act+1)) ^ 2)&lt;br /&gt;dat1 &amp;lt;- matrix(&lt;span style="color: #7f0055; font-weight: bold;"&gt;sqrt&lt;/span&gt;((&lt;span style="color: #7f0055; font-weight: bold;"&gt;log&lt;/span&gt;(dat$act+1) - &lt;span style="color: #7f0055; font-weight: bold;"&gt;log&lt;/span&gt;(dat$pred+1)) ^ 2),&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;(dih))&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# now plot the error surface of act v pred&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# using several methods&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################################&lt;/span&gt;&lt;br /&gt;library(ggplot2)&lt;br /&gt;library(lattice)&lt;br /&gt;library(vcd)&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;require&lt;/span&gt;(grDevices) &lt;span style="color: #3f7f59;"&gt;# for colours&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;### using ggplot2 ###&lt;/span&gt;&lt;br /&gt;df &amp;lt;- data.frame(dat)&lt;br /&gt;names(df) &amp;lt;- c(&lt;span style="color: #2a00ff;"&gt;"actual"&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;"predicted"&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;"error"&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#version1&lt;/span&gt;&lt;br /&gt;errplot &amp;lt;- ggplot(df, aes(actual,predicted, fill=error))&lt;br /&gt;errplot &amp;lt;- errplot + geom_tile()&lt;br /&gt;errplot &amp;lt;- errplot + scale_x_continuous(expand = c(0,0)) &lt;br /&gt;errplot &amp;lt;- errplot + scale_y_continuous(expand = c(0,0))&lt;br /&gt;errplot&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#version2&lt;/span&gt;&lt;br /&gt;fill_gradn &amp;lt;- function(pal) {&lt;br /&gt;  scale_fill_gradientn(colours = pal(7),limits = c(0,.5))&lt;br /&gt;}&lt;br /&gt;errplot &amp;lt;- errplot + fill_gradn(rainbow_hcl)&lt;br /&gt;errplot&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;### using wireframe ###&lt;/span&gt;&lt;br /&gt;wireframe(err ~ act * pred, data = dat&lt;br /&gt;          ,scales = list(arrows = FALSE)&lt;br /&gt;          ,drape = TRUE&lt;br /&gt;          ,colorkey = TRUE,&lt;br /&gt;          screen = list(z = 30, &lt;span style="color: #7f0055; font-weight: bold;"&gt;x&lt;/span&gt; = -60))&lt;br /&gt;    &lt;br /&gt;contourplot(err ~ act * pred, data = dat&lt;br /&gt;          ,region = TRUE&lt;br /&gt;          ,cuts = 10&lt;br /&gt;          ,col.regions = terrain.colors&lt;br /&gt;          )&lt;br /&gt;    &lt;br /&gt;    &lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;## filled contour plot&lt;/span&gt;&lt;br /&gt;filled.contour(&lt;span style="color: #7f0055; font-weight: bold;"&gt;x&lt;/span&gt; = dih, &lt;span style="color: #7f0055; font-weight: bold;"&gt;y &lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;=&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt; dih, z &lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;=&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt; dat1&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #2a00ff;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;,nlevels&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;=&lt;/span&gt;10&lt;br /&gt;                ,color = terrain.colors&lt;br /&gt;                ,plot.title = title(main = &lt;span style="color: #2a00ff;"&gt;"HHP Error Function (the funny log one!)"&lt;/span&gt;,&lt;br /&gt;                xlab =&lt;span style="color: #2a00ff;"&gt;'actual DIH'&lt;/span&gt;, ylab = &lt;span style="color: #2a00ff;"&gt;"predicted DIH"&lt;/span&gt;)&lt;br /&gt;                ,key.title = title(main=&lt;span style="color: #2a00ff;"&gt;"Error"&lt;/span&gt;),&lt;br /&gt;                )&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-6577597193095458454?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/6577597193095458454/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/minimise-what.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6577597193095458454'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6577597193095458454'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/minimise-what.html' title='Minimise the what?'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-F1XG5YDnGYs/TdiP-3LbKcI/AAAAAAAAACw/gJopnJ_Xxc4/s72-c/errorsurfaceA.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-7509892185403002921</id><published>2011-05-18T13:13:00.000-07:00</published><updated>2011-05-18T14:05:28.170-07:00</updated><title type='text'>State of the Union</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial;"&gt;80%+ of a dataminers time is normally spent manipulating the data so they can get to a point where they can start extracting insights. This data preparation is actually the most important part of the process&amp;nbsp;where you should be most diligent.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Here is another example (to add to &lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/beware-integers-of-sql.html"&gt;the one in the previous post&lt;/a&gt;) that demonstrates you don't always get what you think you should get.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;We have been supplied with two data sets of names and want to combine them into a single table. The &lt;span style="color: blue;"&gt;&lt;strong&gt;UNION&lt;/strong&gt;&lt;/span&gt; command will oblige us,&amp;nbsp;but we have got to be very careful how we pose the question...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;(This is MS SQLServer)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-NfYdKTbvGLo/TdQkj6geVwI/AAAAAAAAACY/3eZw2DJ3FzU/s1600/names1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" j8="true" src="http://3.bp.blogspot.com/-NfYdKTbvGLo/TdQkj6geVwI/AAAAAAAAACY/3eZw2DJ3FzU/s1600/names1.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-7H-I2242PxM/TdQkn3gIXNI/AAAAAAAAACc/oQu-Iu2Ql5M/s1600/names2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" j8="true" src="http://2.bp.blogspot.com/-7H-I2242PxM/TdQkn3gIXNI/AAAAAAAAACc/oQu-Iu2Ql5M/s1600/names2.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-Rjmybvz3vbM/TdQktFtQMII/AAAAAAAAACg/VtbeKsTgPzU/s1600/namesunion.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" j8="true" src="http://4.bp.blogspot.com/-Rjmybvz3vbM/TdQktFtQMII/AAAAAAAAACg/VtbeKsTgPzU/s1600/namesunion.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;The result of this is not what we might think. Union seems to ignore the actual name of the field, so our firstnames and surnames are now all muddled up because they were in different column positions in each source table.&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;So to get what we really wanted, we can get the columns aligned first...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-QyiKZaKzBoc/TdQkw853U2I/AAAAAAAAACk/pyIsbcdrZT4/s1600/namescorrect.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="275" j8="true" src="http://3.bp.blogspot.com/-QyiKZaKzBoc/TdQkw853U2I/AAAAAAAAACk/pyIsbcdrZT4/s320/namescorrect.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Don't assume the source data is consistent and the fields are in the same order in the database.&amp;nbsp;Fortunately if you have lots of fields&amp;nbsp;there is a way to check this...&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-aNAlCT2E7kc/TdQwcco_xxI/AAAAAAAAACo/1s_e5RWHq4I/s1600/checkingcolumnorder.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" j8="true" src="http://4.bp.blogspot.com/-aNAlCT2E7kc/TdQwcco_xxI/AAAAAAAAACo/1s_e5RWHq4I/s320/checkingcolumnorder.PNG" width="251" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;If the field orders are not consistent, then it can be a pain explicitly typing out all the fields in your SQL statement to get the orders to match. Now there is also a time saver for this...&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-rfOZNhvdpuo/TdQynCl5p6I/AAAAAAAAACs/H0xfN7uqx90/s1600/fieldnameswithcommas.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" j8="true" src="http://3.bp.blogspot.com/-rfOZNhvdpuo/TdQynCl5p6I/AAAAAAAAACs/H0xfN7uqx90/s1600/fieldnameswithcommas.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;We can then just copy the result and paste it into our SQL to save us&amp;nbsp;typing.&lt;/span&gt;&amp;nbsp;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-7509892185403002921?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/7509892185403002921/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/state-of-union.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7509892185403002921'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7509892185403002921'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/state-of-union.html' title='State of the Union'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-NfYdKTbvGLo/TdQkj6geVwI/AAAAAAAAACY/3eZw2DJ3FzU/s72-c/names1.PNG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-3902181424911432820</id><published>2011-05-18T03:52:00.000-07:00</published><updated>2011-05-18T03:54:57.762-07:00</updated><title type='text'>Beware the Integers of SQL</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Most data miners improve by learning from their mistakes. Many software programs have their own little quirks and sometimes deliver&amp;nbsp;answers that are not quite the answers to the question you thought you were asking...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 300px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;select&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;AVG&lt;/span&gt;(DaysInHospital) &lt;span style="color: #7f0055; font-weight: bold;"&gt;from&lt;/span&gt; DaysInHospital_Y2&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;-- 0&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;select&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;AVG&lt;/span&gt;(DaysInHospital * 1.0) &lt;span style="color: #7f0055; font-weight: bold;"&gt;from&lt;/span&gt; DaysInHospital_Y2&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;-- 0.467101&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;select&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;SUM&lt;/span&gt;(DaysInHospital) / &lt;span style="color: #7f0055; font-weight: bold;"&gt;COUNT&lt;/span&gt;(*) &lt;span style="color: #7f0055; font-weight: bold;"&gt;from&lt;/span&gt; DaysInHospital_Y2&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;-- 0&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;select&lt;/span&gt; &lt;span style="color: #7f0055; font-weight: bold;"&gt;SUM&lt;/span&gt;(DaysInHospital) * 1.0 / &lt;span style="color: #7f0055; font-weight: bold;"&gt;COUNT&lt;/span&gt;(*) &lt;span style="color: #7f0055; font-weight: bold;"&gt;from&lt;/span&gt; DaysInHospital_Y2&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;-- 0.467101542670&lt;/span&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-3902181424911432820?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/3902181424911432820/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/beware-integers-of-sql.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/3902181424911432820'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/3902181424911432820'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/beware-integers-of-sql.html' title='Beware the Integers of SQL'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-8831346045278898052</id><published>2011-05-16T07:40:00.000-07:00</published><updated>2011-05-16T07:46:33.762-07:00</updated><title type='text'>Learning from the Leaderboard - Part 1</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;It is now day 12 of the HHP. I know this as I have been trying to make a submission every day, and the leaderboard says Sali Mali has made 12 submissions.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Dave is keeping his position near the top but Sali Mali is going downhill, as others overtake me.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;I've now started to look at the problem in a little more detail, and decided I needed to a way of replicating the error function that is being used to judge this competiton. This will help&amp;nbsp;in figuring out what sort of leaderboard score I expect my submissions to achieve.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;I've also not been&amp;nbsp;totally wasting my submissions&amp;nbsp;during the time when&amp;nbsp;I have had no useful model to submit. In previous time series competitons, it&amp;nbsp;has been&amp;nbsp;possible to learn a lot from the leaderboard (real examples&amp;nbsp;will be given part 2 of this post), so I figured pinging a few constant valued submissions might help me extract some insight about the future (although I've not really thought about what that insight might be at the moment).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The following R code demonstrates &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;&lt;span style="font-family: Arial;"&gt;how the function to calculate the error was put together&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: Arial;"&gt;how we can easily generate the errors for a sequence of constant valued predictions on Y2 and Y3&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: Arial;"&gt;how the Y4 leaderboard results were read from an Excel file and compared to the Y2 and Y3 results.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Doing this was a sanity check that my error function gave similar results to the leaderboard calculations. It shows that a constant of 0.2 is a good guess and also that the Y4 curve looks more &amp;nbsp;like Y2 than Y3. This is interesting and needs further thought on exactly what this means, if anything!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;###########################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#function to calculate the model error&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################################&lt;/span&gt;&lt;br /&gt;calc_error &amp;lt;- function(act,pred)&lt;br /&gt;{&lt;br /&gt;    aact &amp;lt;- as.matrix(act)&lt;br /&gt;    ppred &amp;lt;- as.matrix(pred)&lt;br /&gt;    &lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(nrow(aact) == nrow(ppred)){&lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;return&lt;/span&gt; (&lt;span style="color: #7f0055; font-weight: bold;"&gt;sqrt&lt;/span&gt;(colSums((&lt;span style="color: #7f0055; font-weight: bold;"&gt;log&lt;/span&gt;(ppred+1) - &lt;span style="color: #7f0055; font-weight: bold;"&gt;log&lt;/span&gt;(aact+1)) ^ 2) / nrow(aact)))&lt;br /&gt;    } &lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt; {&lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;return&lt;/span&gt; (-99)&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#### EOF to calcualte model error ####&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# function to calculate the model errors&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# given a sequence of constant values for&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# the predictions and the known outcome&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###########################################&lt;/span&gt;&lt;br /&gt;err_seq &amp;lt;- function(sequence,act)&lt;br /&gt;{&lt;br /&gt;    &lt;br /&gt;    actual &amp;lt;- as.matrix(act)&lt;br /&gt;&lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;#vector of errors&lt;/span&gt;&lt;br /&gt;    errors &amp;lt;- vector(&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;=&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;(sequence))&lt;br /&gt;&lt;br /&gt;    &lt;span style="color: #3f7f59;"&gt;#vector for the predictions (will be constants)&lt;/span&gt;&lt;br /&gt;    predictions &amp;lt;- vector(&lt;span style="color: #7f0055; font-weight: bold;"&gt;length&lt;/span&gt;=nrow(actual))&lt;br /&gt;&lt;br /&gt;        ind &amp;lt;- 0&lt;br /&gt;        &lt;span style="color: #7f0055; font-weight: bold;"&gt;for&lt;/span&gt;(i in sequence){&lt;br /&gt;            &lt;br /&gt;            &lt;span style="color: #3f7f59;"&gt;#report progress&lt;/span&gt;&lt;br /&gt;            ind = ind + 1&lt;br /&gt;            cat(&lt;span style="color: #2a00ff;"&gt;"&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\n&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;"&lt;/span&gt;,i)&lt;br /&gt;            flush.console()&lt;br /&gt;            &lt;br /&gt;            &lt;span style="color: #3f7f59;"&gt;#set the prediction to the constant     &lt;/span&gt;&lt;br /&gt;            predictions[] &amp;lt;- i    &lt;br /&gt;            &lt;span style="color: #3f7f59;"&gt;#calculate the error&lt;/span&gt;&lt;br /&gt;            errors[ind] &amp;lt;- calc_error(act=actual,pred=predictions)&lt;br /&gt;        }&lt;br /&gt;&lt;br /&gt;    &lt;span style="color: #7f0055; font-weight: bold;"&gt;return&lt;/span&gt; (errors)&lt;br /&gt;}&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;###EOF to calculate sequence of errors ####&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#connect to data&lt;/span&gt;&lt;br /&gt;library(RODBC) &lt;span style="color: #3f7f59;"&gt;#for data connection&lt;/span&gt;&lt;br /&gt;conn &amp;lt;- odbcConnect(&lt;span style="color: #2a00ff;"&gt;"sql_server_HHP"&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#load the actual days in hospital&lt;/span&gt;&lt;br /&gt;actualY2 &amp;lt;- sqlQuery(conn,&lt;span style="color: #2a00ff;"&gt;"select DaysInHospital from DaysInHospital_Y2"&lt;/span&gt;)&lt;br /&gt;actualY3 &amp;lt;- sqlQuery(conn,&lt;span style="color: #2a00ff;"&gt;"select DaysInHospital from DaysInHospital_Y3"&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#set up a sequence of constants to be used as the predictions&lt;/span&gt;&lt;br /&gt;const_preds &amp;lt;- seq(from=0, to=1, by=0.01)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#calculate the error sequences for Y2 &amp;amp; Y3&lt;/span&gt;&lt;br /&gt;Y2Err &amp;lt;- err_seq(sequence=const_preds,act=actualY2)&lt;br /&gt;Y3Err &amp;lt;- err_seq(sequence=const_preds,act=actualY3)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#read in the errors for Y4 from an excel file &lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#where they were recorded&lt;/span&gt;&lt;br /&gt;library(xlsReadWrite)&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#this is also required to be run&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#see &lt;/span&gt;&lt;span style="color: #3f3fbf;"&gt;http://www.swissr.org/software/xlsreadwrite&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#xls.getshlib() &lt;/span&gt;&lt;br /&gt;Y4 &amp;lt;- &lt;span style="color: #7f0055; font-weight: bold;"&gt;read&lt;/span&gt;.xls(&lt;span style="color: #2a00ff;"&gt;"E:&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;comps&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;hhp&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;constants.xls"&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#############################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#plot prediction v error&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#############################&lt;/span&gt;&lt;br /&gt;ymin &amp;lt;- min(Y2Err,Y3Err)&lt;br /&gt;ymax &amp;lt;- max(Y2Err,Y3Err)&lt;br /&gt;plot(const_preds,Y2Err&lt;br /&gt;    ,type=&lt;span style="color: #2a00ff;"&gt;'l'&lt;/span&gt;&lt;br /&gt;    ,main=&lt;span style="color: #2a00ff;"&gt;'Learning from the Leaderboard'&lt;/span&gt;&lt;br /&gt;    ,xlab=&lt;span style="color: #2a00ff;"&gt;'Predicted Constant'&lt;/span&gt;&lt;br /&gt;    ,ylab=&lt;span style="color: #2a00ff;"&gt;'Error'&lt;/span&gt;&lt;br /&gt;    ,ylim=c(ymin, ymax)&lt;br /&gt;    ,col=&lt;span style="color: #2a00ff;"&gt;'blue'&lt;/span&gt;&lt;br /&gt;    )&lt;br /&gt;lines(const_preds,Y3Err,col=&lt;span style="color: #2a00ff;"&gt;'green'&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#add the line for Y4, discovered via leaderboard&lt;/span&gt;&lt;br /&gt;lines(Y4[,1],Y4[,2],col=&lt;span style="color: #2a00ff;"&gt;'red'&lt;/span&gt;,lwd=2)&lt;br /&gt;&lt;br /&gt;legend(&lt;span style="color: #2a00ff;"&gt;'bottomright'&lt;/span&gt;, c(&lt;span style="color: #2a00ff;"&gt;'Y2 - known'&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;'Y3 - known'&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;'Y4 - via leaderboard'&lt;/span&gt;),lty=1, col=c(&lt;span style="color: #2a00ff;"&gt;"blue"&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;"green"&lt;/span&gt;,&lt;span style="color: #2a00ff;"&gt;"red"&lt;/span&gt;))&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-6P7mli_Rg78/TdEw8JWoliI/AAAAAAAAACU/zzK68npGsBw/s1600/learing+from+the+leaderboard.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" j8="true" src="http://4.bp.blogspot.com/-6P7mli_Rg78/TdEw8JWoliI/AAAAAAAAACU/zzK68npGsBw/s320/learing+from+the+leaderboard.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-8831346045278898052?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/8831346045278898052/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/learning-from-leaderboard-part-1.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8831346045278898052'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/8831346045278898052'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/learning-from-leaderboard-part-1.html' title='Learning from the Leaderboard - Part 1'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-6P7mli_Rg78/TdEw8JWoliI/AAAAAAAAACU/zzK68npGsBw/s72-c/learing+from+the+leaderboard.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-3816573175487585683</id><published>2011-05-13T16:06:00.000-07:00</published><updated>2011-05-14T21:36:58.448-07:00</updated><title type='text'>Connecting R to our Database and Plotting Histograms</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;So far with the HHP I have&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;loaded the data into a database&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;written an R script to check the submission file is correct, with the added bonus of plotting a distribution for a sanity check&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&amp;nbsp;As yet I haven't even really looked at the data and don't really know what the problem is we are being asked to solve. But not to worry, all that will hopefully fall into place once we have our tools all set up and talking to each other.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Data lives in a database, but it is no good to anyone if we can't get&amp;nbsp;at it to analyse. It is possible to access this data, as the database developers have very genourosly provided what are called database drivers that allow these other applications to get to the data, and even change it if so desired. Many analytic applications have provided the means to utilise these drivers so the data can be sucked into their tools for analysis.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The task of connecting your applications to databases can be very painful if you are not really a database person and don't understand the lingo of what you are being asked to do. Some applications make it very easy (I have found &lt;a href="http://www.tableausoftware.com/"&gt;Tableau&lt;/a&gt; remarkably painless and my own software &lt;a href="http://www.tiberius.biz/helpfiles/loadingdata.html"&gt;Tiberius has instructions on how to do this&lt;/a&gt;) and some you have to Google around a lot find the answers to your questions. It is amazing how many products in their documentation just say 'contact your&amp;nbsp;system administrator' - which is no help if it is you.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;In the end though, once you have connected that is the biggest hurdle over, and then the world is your oyster.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;So, we now want to connect SQL Server to R. First the code, then the detail.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&lt;i&gt;Note: If you are viewing this in Internet Explorer or Chrome, you&amp;nbsp;might get text wrapping in the code below and no horizontal scroll bar. If you are using FireFox then you get the scroll bar and no text wrapping. Either way you can still copy and paste the code OK. &lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/damn-computers.html"&gt;Another Damn Computer!&amp;nbsp;&lt;/a&gt;&lt;/i&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre style="background: #ffffff; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;# load the required libraries&lt;/span&gt;&lt;br /&gt;library(RODBC) &lt;span style="color: #3f7f59;"&gt;#for data connection&lt;/span&gt;&lt;br /&gt;library(lattice) &lt;span style="color: #3f7f59;"&gt;#for histograms&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#######################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# set up a connection to the database&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#######################################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#method 1 - using a connection string&lt;/span&gt;&lt;br /&gt;conn &amp;lt;- odbcDriverConnect(&lt;span style="color: #2a00ff;"&gt;"driver=SQL Server;database=HHP;server=PHIL-PC&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;SQLEXPRESS;"&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#method 2 - involves setting up a DSN (Data Source Name)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#conn &amp;lt;- odbcConnect("sql_server_HHP")&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;########################################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# extract the data using this connection and some SQL&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;########################################################&lt;/span&gt;&lt;br /&gt;mydata &amp;lt;- sqlQuery(conn,&lt;span style="color: #2a00ff;"&gt;"select  * from DaysInHospital_Y2"&lt;/span&gt;)&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#mydata &amp;lt;- sqlQuery(conn,"select  * from Claims")&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;############################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# take a look at the data&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;############################&lt;/span&gt;&lt;br /&gt;colnames(mydata)&lt;br /&gt;head(mydata)&lt;br /&gt;summary(mydata)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;############################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# plot the distributions&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;############################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#set up the plot layout&lt;/span&gt;&lt;br /&gt;maxplots &amp;lt;- 9 &lt;span style="color: #3f7f59;"&gt;#upper limit on plots&lt;/span&gt;&lt;br /&gt;plots &amp;lt;- NCOL(mydata)&lt;br /&gt;plots &amp;lt;- min(plots,maxplots)&lt;br /&gt;sideA &amp;lt;- ceiling(&lt;span style="color: #7f0055; font-weight: bold;"&gt;sqrt&lt;/span&gt;(plots))&lt;br /&gt;sideB &amp;lt;- ceiling(plots/sideA)&lt;br /&gt;cells &amp;lt;- sideA * sideB&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#par(mfrow=c(sideA,sideB)) #use this for standard graphics&lt;/span&gt;&lt;br /&gt;plotposition &amp;lt;- matrix(1:cells,sideA,sideB,byrow = T)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#draw all the plots&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;for&lt;/span&gt;(i in 1:plots){&lt;br /&gt;myplot &amp;lt;- histogram(mydata[,i],main=colnames(mydata[i]),xlab=&lt;span style="color: #2a00ff;"&gt;''&lt;/span&gt;)&lt;br /&gt;myposition &amp;lt;- rev(which(plotposition == i, arr.ind=TRUE))&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;print&lt;/span&gt;(myplot, &lt;span style="color: #7f0055; font-weight: bold;"&gt;split&lt;/span&gt;=c(myposition,sideA,sideB), more=TRUE)&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The code above is R code and is run from R. It assumes you have the library RODBC installed. If you don't, then just type&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;span style="color: blue; font-family: Arial, Helvetica, sans-serif;"&gt;&lt;i&gt;install.packages('RODBC')&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;So, how do you get it to work?&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;There are 2 options to connect to the database, methods 1 &amp;amp; 2.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;In &lt;span style="color: red;"&gt;&lt;b&gt;method 1&lt;/b&gt;&lt;/span&gt;, you need to know 3 bits of information (you also might need user IDs and passwords depending on how access to the database is set up),&lt;/span&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-family: Arial;"&gt;the&amp;nbsp;driver name (&lt;b&gt;SQL Server&lt;/b&gt; in our case)&lt;/span&gt;&lt;/li&gt;&lt;li style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none;"&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;the database server name. This can be tricky to remember. For SQL Server the easiest way to find it is when you start up the SQL Server Management Studio. In the picture below you will see the server name is PHIL-PC\SQLEXPRESS.&lt;/span&gt;&lt;a href="http://3.bp.blogspot.com/-f1YlH8QeSFo/Tc28J49umAI/AAAAAAAAABw/KdqhG5NnOU8/s1600/servername.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="236" j8="true" src="http://3.bp.blogspot.com/-f1YlH8QeSFo/Tc28J49umAI/AAAAAAAAABw/KdqhG5NnOU8/s320/servername.PNG" width="320" /&gt;&lt;/a&gt; &lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&lt;br /&gt;There is one &lt;i&gt;gotcha&lt;/i&gt; here. You will notice in the R script we have an extra '\'&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;u&gt;server=PHIL-PC\\SQLEXPRESS&lt;/u&gt;&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none;"&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;This is because '\' is a special charater in R, so you need to put&amp;nbsp;two of them together to tell it that&amp;nbsp;you really do mean '\'.&amp;nbsp;This is useful to know when you have paths of files etc.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Another thing to note above, we are in Windows Authentication mode, which means we won't have to worry with passwords.&lt;/span&gt;&lt;/div&gt;&lt;div style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none;"&gt;&amp;nbsp;&amp;nbsp;&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;&lt;li style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none;"&gt;&lt;div style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;The name of the database in the database server that you want to connect to. You can discover your options from within the Management Studio. Here we have &lt;i&gt;HHP&lt;/i&gt; and &lt;i&gt;test&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-JXFdFBsiovk/Tc2_XxkhVQI/AAAAAAAAAB0/4wIYYhT571A/s1600/databases.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" j8="true" src="http://4.bp.blogspot.com/-JXFdFBsiovk/Tc2_XxkhVQI/AAAAAAAAAB0/4wIYYhT571A/s1600/databases.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;﻿So when we string all this together we get,&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;span style="color: blue; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"&gt;conn &amp;lt;- odbcDriverConnect("driver=SQL Server;database=HHP;server=PHIL-PC\\SQLEXPRESS;")&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;span style="color: red; font-family: Arial, Helvetica, sans-serif;"&gt;Method 2&lt;span style="color: black;"&gt; involves creating a &lt;a href="http://en.wikipedia.org/wiki/Database_Source_Name"&gt;Data Source Name (DSN),&lt;/a&gt; which can be thought of as the above bits of information but stored away somewhere in your computer, just wrapped under a name. With many applications, you just then have to specify this name to connect to the data.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;On Windows machines, a DSN is created via a wizard that takes you through the steps. &lt;a href="http://support.microsoft.com/kb/305599"&gt;Here are instructions for XP&lt;/a&gt;.&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;The one Gotcha&amp;nbsp;that I commonly encounter with SQL Server is that it tries to populate the list of available servers so you can just select it from a list, sometimes you can be waiting forever and you are not guaranteed to get your server in the list.&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-PFH5ByKFGwU/Tc4KRCtfO-I/AAAAAAAAAB4/n1_XxZcvSoI/s1600/dsnwizard.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="222" j8="true" src="http://1.bp.blogspot.com/-PFH5ByKFGwU/Tc4KRCtfO-I/AAAAAAAAAB4/n1_XxZcvSoI/s320/dsnwizard.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;You can see above that it lists PHIL-PC,&amp;nbsp;but this will fail, it needs to be PHIL-PC\SQLEXPRESS.&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;Also, if you are connected to a network, the list of servers could be pretty big and picking the one you need would be difficult to remember. If you do this often it can be worthwhile just having your common servers in a list and just pasting them in.&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;With the wizard, it&amp;nbsp;can test the connection for you, so you know if your settings are correct.&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The end result of all this is that we can connect R to our database and plot a graph. We can now get a sense of what the data is all about.&amp;nbsp;Progress!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Interesting observations - Days in Hospital is mainly 0, then 1-5, with another hardly visible blip at 15. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-VaftxzwSTYM/Tc9WSUO39sI/AAAAAAAAACM/9MHNFdSEfsg/s1600/histograms_daysinhospital.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="221" j8="true" src="http://2.bp.blogspot.com/-VaftxzwSTYM/Tc9WSUO39sI/AAAAAAAAACM/9MHNFdSEfsg/s320/histograms_daysinhospital.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-DobPEC-lcDo/Tc4PwOAdHkI/AAAAAAAAACA/iyqqsR8trxU/s1600/secondhistogram.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" j8="true" src="http://3.bp.blogspot.com/-DobPEC-lcDo/Tc4PwOAdHkI/AAAAAAAAACA/iyqqsR8trxU/s320/secondhistogram.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;span style="color: red; font-family: Arial, Helvetica, sans-serif;"&gt;hist(mydata$DaysInHospital[mydata$DaysInHospital &amp;gt; 1])&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-e72gCECKpNI/Tc4SNwBs6EI/AAAAAAAAACE/am_sbLGhyt4/s1600/hist_days_gt_1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" j8="true" src="http://3.bp.blogspot.com/-e72gCECKpNI/Tc4SNwBs6EI/AAAAAAAAACE/am_sbLGhyt4/s320/hist_days_gt_1.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Interesting, obviously an anomaly in the data? Or maybe we should now start to read about the problem we are trying to solve and how the data has been put together.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;We can just change the query to look at a different table in the database. The Claims data looks like this...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-BrSyZiVgy6o/Tc9XPk5AWVI/AAAAAAAAACQ/AhNbtC7m4Nc/s1600/histograms_claims.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="221" j8="true" src="http://2.bp.blogspot.com/-BrSyZiVgy6o/Tc9XPk5AWVI/AAAAAAAAACQ/AhNbtC7m4Nc/s320/histograms_claims.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;This gives us an idea of what we are dealing with.&amp;nbsp;Now to&amp;nbsp; '&lt;em&gt;&lt;em&gt;shake rattle and roll!'&lt;/em&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;em&gt;&amp;nbsp;&lt;/em&gt;&lt;/span&gt;&lt;/em&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-3816573175487585683?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/3816573175487585683/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/connecting-r-to-our-database.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/3816573175487585683'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/3816573175487585683'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/connecting-r-to-our-database.html' title='Connecting R to our Database and Plotting Histograms'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-f1YlH8QeSFo/Tc28J49umAI/AAAAAAAAABw/KdqhG5NnOU8/s72-c/servername.PNG' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-7540752686192448808</id><published>2011-05-11T06:13:00.000-07:00</published><updated>2011-05-11T06:13:29.113-07:00</updated><title type='text'>Relevant Adds are Here!</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;I posted a piece on loading in the data, and now I see the adds appearing related to this. This is good, I even wanted to click on them myself to see what they were, but I resisted as I want the kitty to be non funded by me.&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-_ocLfdQ-XGw/TcqJ_TvVEoI/AAAAAAAAABo/sy7lES1cPmQ/s1600/morerelevantadds.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="287" src="http://3.bp.blogspot.com/-_ocLfdQ-XGw/TcqJ_TvVEoI/AAAAAAAAABo/sy7lES1cPmQ/s320/morerelevantadds.PNG" width="292" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;So click away boys and girls if you like the adds. I am $2 down today already as I went to the laundrette to dry my clothes (rained in Melbourne today) and put my $2 in the machine to get some change and all that came out was a flattened hearing aid battery.&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-7540752686192448808?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/7540752686192448808/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/relevant-adds-are-here.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7540752686192448808'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7540752686192448808'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/relevant-adds-are-here.html' title='Relevant Adds are Here!'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-_ocLfdQ-XGw/TcqJ_TvVEoI/AAAAAAAAABo/sy7lES1cPmQ/s72-c/morerelevantadds.PNG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-7994036253147361690</id><published>2011-05-11T05:56:00.000-07:00</published><updated>2011-05-13T15:49:23.692-07:00</updated><title type='text'>Beware the HHP submission sort order</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;When a solution is submitted for the HHP, the file has to contain the memberID and the predicted Days In Hospital for year 4.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;a href="http://www.heritagehealthprize.com/c/hhp/forums/t/533/the-order-of-memberid-in-target"&gt;It was noticed early on by a competitor&lt;/a&gt; that the memberIDs in the example file provided were not in numerical order, in fact they were quite random. Kaggle confirmed that the submissions had to be in the same memberID order as the supplied example file.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;This now causes a concern, being as the data is now in a database. There is no real concept of row order in a database, just because you enter data in a particular order, there is no guarantee it will come out in that order. If you need a specific order, then the &lt;b&gt;order by&lt;/b&gt; statement should be used in the SQL to specify the order required.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;So how can we get the specified order?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;We need to add a new column to the Target.csv file that contains the row order. A simple way to do this in Excel is just to add an extras column going from 1 to the number of cases, in ascending order.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Another way to create this extra column is to use R. When R reads data it conserves the order the file is read. The code below will create a new file with an extra column.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 300px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;pre style="background: #dddddd; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;#########################################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# the customer ID is not in any particular order in&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# the submission file. This code creates a sort order&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# that we can use to ensure things are sorted correctly.&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#########################################################&lt;/span&gt;&lt;br /&gt;setwd(&lt;span style="color: #2a00ff;"&gt;'E:&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;comps&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;hhp&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;raw data&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;HHP_release2'&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#read in the original&lt;/span&gt;&lt;br /&gt;Target &amp;lt;- &lt;span style="color: #7f0055; font-weight: bold;"&gt;read&lt;/span&gt;.csv(&lt;span style="color: #2a00ff;"&gt;'Target.csv'&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#add an extra column called orderbyme&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#with the row number as the value&lt;/span&gt;&lt;br /&gt;Target$orderbyme &amp;lt;- 1:nrow(Target)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#take a look at the result&lt;/span&gt;&lt;br /&gt;head(Target)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#write out a new file&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;write&lt;/span&gt;.csv(Target, file=&lt;span style="color: #2a00ff;"&gt;'Target_with_order.csv'&lt;/span&gt;,row.names=FALSE,na = &lt;span style="color: #2a00ff;"&gt;""&lt;/span&gt;)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;(the syntax highlighting is not perfect for R, it is for Perl, but lets you see the comments easily)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Now if we load this into SQL Server with the rest of the data, we have a field we can always sort by.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-uBhzJH4RGJ4/TcqBcFifvjI/AAAAAAAAABc/Zkjh8TzfyDA/s1600/orderbyme.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="280" src="http://3.bp.blogspot.com/-uBhzJH4RGJ4/TcqBcFifvjI/AAAAAAAAABc/Zkjh8TzfyDA/s320/orderbyme.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&amp;nbsp;Once a submission file is generated, we can also double check it in R and plot the distribution to give it a sanity check.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;pre style="background: #dddddd; color: black;"&gt;&lt;span style="color: #3f7f59;"&gt;#########################################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# this R code reads in a submission file, checks it and&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;# plots the distribution of predicted days in hospital&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#########################################################&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#change this to the name of the submission file to check&lt;/span&gt;&lt;br /&gt;file_name &amp;lt;- &lt;span style="color: #2a00ff;"&gt;'submission6.csv'&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#the submission files and&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#Target.csv should be here&lt;/span&gt;&lt;br /&gt;setwd(&lt;span style="color: #2a00ff;"&gt;'E:&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;comps&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;hhp&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;\\&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;submissions'&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#read in the files&lt;/span&gt;&lt;br /&gt;sub_file &amp;lt;- &lt;span style="color: #7f0055; font-weight: bold;"&gt;read&lt;/span&gt;.csv(file_name, header=TRUE)&lt;br /&gt;compare_file &amp;lt;- &lt;span style="color: #7f0055; font-weight: bold;"&gt;read&lt;/span&gt;.csv(&lt;span style="color: #2a00ff;"&gt;'Target.csv'&lt;/span&gt;,header=TRUE)&lt;br /&gt; &lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#check the file has the correct number of rows and columns&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(nrow(sub_file) != 70942){cat(&lt;span style="color: #2a00ff;"&gt;'\n\n***invalid number of rows***\n\n'&lt;/span&gt;)}&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(ncol(sub_file) != 3){cat(&lt;span style="color: #2a00ff;"&gt;'\n\n***invalid number of columns***\n\n'&lt;/span&gt;)}&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#check the example file&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(nrow(compare_file) != 70942){cat(&lt;span style="color: #2a00ff;"&gt;'\n\n***Target.csv is corrupted***\n\n'&lt;/span&gt;)}&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(ncol(compare_file) != 3){cat(&lt;span style="color: #2a00ff;"&gt;'\n\n***Target.csv is corrupted***\n\n'&lt;/span&gt;)}&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#check the ids match&lt;/span&gt;&lt;br /&gt;differences &amp;lt;- &lt;span style="color: #7f0055; font-weight: bold;"&gt;abs&lt;/span&gt;(sub_file[,1] - compare_file[,1])&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(max(differences) != 0){&lt;br /&gt;cat(&lt;span style="color: #2a00ff;"&gt;'\n\n***IDs in wrong order***\n\n'&lt;/span&gt;)&lt;br /&gt;}&lt;span style="color: #7f0055; font-weight: bold;"&gt;else&lt;/span&gt;{&lt;br /&gt;cat(&lt;span style="color: #2a00ff;"&gt;'\n\n***IDs are in the correct order***\n\n'&lt;/span&gt;)&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#check the range&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(max(sub_file[,3]) &amp;gt; 15){&lt;br /&gt;cat(&lt;span style="color: #2a00ff;"&gt;'\n\n***you have predictions &amp;gt; 15 days!***\n\n'&lt;/span&gt;)}&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #7f0055; font-weight: bold;"&gt;if&lt;/span&gt;(min(sub_file[,3]) &amp;lt; 0){&lt;br /&gt;cat(&lt;span style="color: #2a00ff;"&gt;'\n\n***you have predictions &amp;lt; 0 days!***\n\n'&lt;/span&gt;)}&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#####################################&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#plot the submission distribution&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#####################################&lt;/span&gt;&lt;br /&gt;num_bins &amp;lt;- 50&lt;br /&gt;theplot &amp;lt;- hist(sub_file[,3]&lt;br /&gt;    ,breaks=num_bins&lt;br /&gt;    ,xlim = c(0, max(sub_file[,3]) + 0.1)&lt;br /&gt;    ,main = paste(&lt;span style="color: #2a00ff;"&gt;"Distribution of "&lt;/span&gt;,file_name)&lt;br /&gt;    ,xlab = &lt;span style="color: #2a00ff;"&gt;"Predicted Days in Hospital"&lt;/span&gt;&lt;br /&gt;    )&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #3f7f59;"&gt;#the mean prediction&lt;/span&gt;&lt;br /&gt;mean_pred &amp;lt;- colMeans(sub_file[3])&lt;br /&gt;&lt;br /&gt;xpos = max(sub_file[,3]) * 0.5&lt;br /&gt;ypos = max(theplot$counts) * 0.75&lt;br /&gt;mytext = paste(&lt;span style="color: #2a00ff;"&gt;"Mean submitted DIH= "&lt;/span&gt;,&lt;span style="color: #7f0055; font-weight: bold;"&gt;sprintf&lt;/span&gt;(&lt;span style="color: #2a00ff;"&gt;"&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;%&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;.&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;4f&lt;/span&gt;&lt;span style="color: #2a00ff;"&gt;"&lt;/span&gt;, mean_pred))&lt;br /&gt;text(xpos,ypos,mytext,&lt;span style="color: #7f0055; font-weight: bold;"&gt;pos&lt;/span&gt;=4)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;and this is what I&amp;nbsp; get for one of my submissions to date...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-PJXT4jEFSTc/TcqEQRUa3eI/AAAAAAAAABg/kRa67853ejk/s1600/submissikondistribution.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/-PJXT4jEFSTc/TcqEQRUa3eI/AAAAAAAAABg/kRa67853ejk/s320/submissikondistribution.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;and interestingly if you change num_bins to 5,000, you get...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-AZZM7zwD-h4/TcqE-TNFfqI/AAAAAAAAABk/KFh5RE-x40w/s1600/submissikondistribution5000.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://3.bp.blogspot.com/-AZZM7zwD-h4/TcqE-TNFfqI/AAAAAAAAABk/KFh5RE-x40w/s320/submissikondistribution5000.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;which will lead to some investigation why we get this funny distribution.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-7994036253147361690?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/7994036253147361690/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/beware-hhp-submission-sort-order.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7994036253147361690'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7994036253147361690'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/beware-hhp-submission-sort-order.html' title='Beware the HHP submission sort order'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-uBhzJH4RGJ4/TcqBcFifvjI/AAAAAAAAABc/Zkjh8TzfyDA/s72-c/orderbyme.PNG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-7386572215821233159</id><published>2011-05-10T15:08:00.000-07:00</published><updated>2011-05-10T19:56:33.061-07:00</updated><title type='text'>Some are saying this guy will win the #drflix prize</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;I suddenly started receiving emails saying I had a&amp;nbsp;first visitor from Belguim, UK etc. What was all this about&amp;nbsp;- I'd only told 2 people about this blog and was not ready to go public just yet.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;A bit of data mining via the blog web stats pointed to twitter, and then I saw this on the HHP page next to the leaderboard. See the top post on the right hand side - I clicked on the link and was taken aback when it lead to this blog.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;I'm not into twitter, so can someone explain what #drflix is?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-45HJWNIUWGw/Tcm0AU-B4CI/AAAAAAAAABI/y3yNbqH1KZw/s1600/twitter.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="208" j8="true" src="http://4.bp.blogspot.com/-45HJWNIUWGw/Tcm0AU-B4CI/AAAAAAAAABI/y3yNbqH1KZw/s320/twitter.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none;"&gt;&lt;span style="font-family: Arial;"&gt;The number of visitors is now&amp;nbsp;85 from 16 countries&amp;nbsp;(about the same as the number of teams now entered) but the kitty is still empty, so I guess I need to start talking about data mining more to get some relevant adds. I'm also getting some error messages because of that web counter gadget, page loading slow because of video type things - it may have to go, when I get passed 100 visitors maybe.&lt;/span&gt;&lt;/div&gt;&lt;div style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;As to who will win, Dave is looking pretty strong at the moment. He remains at the top - I knew I shouldn't have&lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/day-i-led-hhp-briefly.html"&gt; bragged to him about my early triumph&lt;/a&gt;. Meanwhile I am sitting at no. 23, tucked in the middle of the peleton.&amp;nbsp;&amp;nbsp;Hope to tell you how I got their soon.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-m5-csemjMAY/Tcm2qkxkquI/AAAAAAAAABM/dm2lVrWnGyo/s1600/ledavestillleading.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="144" j8="true" src="http://4.bp.blogspot.com/-m5-csemjMAY/Tcm2qkxkquI/AAAAAAAAABM/dm2lVrWnGyo/s320/ledavestillleading.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-7386572215821233159?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/7386572215821233159/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/some-are-saying-this-guy-will-win.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7386572215821233159'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/7386572215821233159'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/some-are-saying-this-guy-will-win.html' title='Some are saying this guy will win the #drflix prize'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-45HJWNIUWGw/Tcm0AU-B4CI/AAAAAAAAABI/y3yNbqH1KZw/s72-c/twitter.PNG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-1694818788659984506</id><published>2011-05-10T05:45:00.000-07:00</published><updated>2011-05-10T14:41:09.500-07:00</updated><title type='text'>Damn Computers</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;MICROSOFT&lt;/span&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Been having a few issues with my Windows 7 64 bit laptop. Basically the other day it decided to tell me not to switch it off as it was going to install 125 updates. After a few hours waiting everything finished, but some very useful software I had wrote&amp;nbsp;using Microsofts very own compiler no longer worked. It did work once but it looked as if the font was Chinese, and in subsequent&amp;nbsp;attempts it seemed to want to run as a service. This was odd as the very same executable worked fine on my 32 bit machine.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;To try to solve this issue I recompiled&amp;nbsp;on the 64 bit machine and everything seemed OK .&amp;nbsp;To my delight&amp;nbsp;it seems now&amp;nbsp;that a single executable is all that is need to&amp;nbsp;run on 32 and 64 bit machines, which is good. The new issue was that&amp;nbsp;I compiled with a setting that says you need dot net 4, and my 32 bit machine didn't have that. Rather than make people have to install dot net 4, I tried to recompile with a setting that says you only need dot net 2, but&amp;nbsp;when I changed this setting the compiler got stuck in some kind of neverending loop and didn't do anything.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The reason for&amp;nbsp;saying this is that I&amp;nbsp;am going to mention this software in future posts - but&amp;nbsp;if you want to use it&amp;nbsp;Microsoft decrees you will need to install&amp;nbsp;the latest version of dot net, unless anyone can tell me how to solve this problem.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&lt;strong&gt;BLOGGER &amp;amp; FIREFOX&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Another issue that I've had is with this blog. I tried to edit posts and again, the little circle wizzed around indicating that it was thinking, but and hour later it was still thinking. I also couldn't seem to get any love from the template settings options.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;I was using Firefox, which&amp;nbsp;had also recently updated itself,&amp;nbsp;so thought I'd try to see if it was the browser. Being as the blog was a Google thing I thought I'd download Chrome.&amp;nbsp;Now when I tried to goto the Chrome download page, I got the&amp;nbsp;message from Google saying wait for the download&amp;nbsp; and the same wizzing circles, which also wanted to be there for hours. I suspect this is a Firefox anti Google thing. I then tried IE and everything works fine in that, so Microsoft have redeemed themselves a little.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&lt;strong&gt;RATTLE&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;The next issue I had was with trying to get &lt;a href="http://rattle.togaware.com/"&gt;Rattle&lt;/a&gt; to run in 64 bit mode. There are a few documented issues of things you need to check, but I tried them all but couldn't get it to work. I have it working in 32 bit R, but 64 bit would be nice. Rattle is a nice useful bit of data mining software that I will hopefully be mentioning at some point in this blog.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;strong&gt;PGP Encription&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;I have recently had some &lt;a href="http://en.wikipedia.org/wiki/Pretty_Good_Privacy"&gt;Pretty Good Privacy&lt;/a&gt; put on my hard drive, which&amp;nbsp;at first glance seems a pretty good idea if your computer contains valuable information.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;Now for me, the valuable information is large datasets that can contain sensitive details you don't want&amp;nbsp;to find&amp;nbsp;make their way into the newspaper&amp;nbsp;should you leave your laptop&amp;nbsp;on the train. The software I mentioned earlier reads these data files, but what I am finding is that it will read the first 30,000 or so rows in a few seconds, but then grind to a really slow pace after that, which is bad when you have a few million rows.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;In order to eliminate the fact that it might be a result of the new compilation of the software, I moved on of the datasets to a different drive that was not encrypted, and voila&amp;nbsp;- it worked as expected. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;&lt;strong&gt;Java &amp;amp; Eclipse&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial;"&gt;I was using &lt;a href="http://www.spss.com/"&gt;SPSS&lt;/a&gt; (a tool that can do things with data)&amp;nbsp;and was quite merrily&amp;nbsp;putting together a model when&amp;nbsp;the window quite rudely&amp;nbsp;decided&amp;nbsp;it was not playing.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-cd7obyWN_u8/TcmtCFsFfvI/AAAAAAAAABE/58Xv5WKvnXA/s1600/blankingsmall.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="236" j8="true" src="http://3.bp.blogspot.com/-cd7obyWN_u8/TcmtCFsFfvI/AAAAAAAAABE/58Xv5WKvnXA/s320/blankingsmall.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Initially I&amp;nbsp;suspected it was an issue with SPSS rather than something I had corrupted on my PC ,&amp;nbsp;especially when a colleague who had exactly the same PC and OS got the same problem. This was an issue for me, particualry as the blackout was unannounced and made you lose any unsaved work.&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;SPSS was redeemed when I also then discovered it happedned in another application, unfortunately for me the one that helped me get my expenses reimbursed. The common denominator was that the applications&amp;nbsp;were both built on Eclipse.&amp;nbsp;&amp;nbsp;All evidence pointed to the video drivers and java not being a happy couple.&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial;"&gt;I found there were some new drivers available for download and installed them. Hopefully this has fixed the issue, but as&amp;nbsp;this issue happens&amp;nbsp;unannanounced&amp;nbsp;half way through some work, I have learnt&amp;nbsp;to save things quite frequently, just in case it happens again.&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;﻿&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-1694818788659984506?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/1694818788659984506/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/damn-computers.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/1694818788659984506'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/1694818788659984506'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/damn-computers.html' title='Damn Computers'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-cd7obyWN_u8/TcmtCFsFfvI/AAAAAAAAABE/58Xv5WKvnXA/s72-c/blankingsmall.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-6605498294152192587</id><published>2011-05-07T14:09:00.000-07:00</published><updated>2011-05-16T15:02:38.218-07:00</updated><title type='text'>Progress - Loading the HHP Data</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;A little progress...&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;ol style="font-family: Arial, Helvetica, sans-serif; text-align: left;"&gt;&lt;li&gt;My daughter made it through her first swimming lesson without needing a swimming nappy. This was mainly my fault as we had ran out and the lesson was too early to have time to go to the shop. Anyway, that should save me some cash, so maybe I will no longer need the full 3 million.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;I managed to find a way of posting code with syntax highlighting into this blog&lt;br /&gt;&lt;/li&gt;&lt;li&gt;I managed to load the HHP data into a database&lt;br /&gt;&lt;/li&gt;&lt;li&gt;I picked up some bugs and improved some of my software while doing 3).&lt;/li&gt;&lt;/ol&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;So, now the details....&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;1) Well done Princess no. 3&lt;/b&gt;&lt;br /&gt;(but please tell us where you hid the TV remote)&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;b&gt;2) Highlighting Code in a blog&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;When you do data mining, you would generally at some point have to write some computer code that does stuff to your data. You will notice that when you write the code it generally turns nice colours to highlight things that make the code easier to interpret.&lt;br /&gt;&lt;br /&gt;I will hopefully be doing lots of this and posting code snippets to this blog so others can get ideas or use the exact same code to try out&amp;nbsp; (not sure how this fits in with the rules of the contest, but lets go for it anyway). It would be nice if the code also coloured itself in the blog.&lt;br /&gt;&lt;br /&gt;A Google search came up with this site &lt;a href="http://tohtml.com/" target="_new"&gt;http://tohtml.com/&lt;/a&gt;. You just paste your code in a window, select what computer language the code is written in, press a button and presto, it generates html code for you. You then just edit the html of your blog and paste it in, and should get something like this...&lt;/div&gt;&lt;br /&gt;&lt;pre style="background: #f6f8ff; color: #000020;"&gt;Sub IsThisRight&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;    MsgBox &lt;span style="color: #308080;"&gt;(&lt;/span&gt;IsNumeric&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: maroon;"&gt;"&lt;/span&gt;&lt;span style="color: #1060b6;"&gt;50+&lt;/span&gt;&lt;span style="color: maroon;"&gt;"&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;End Sub&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;and if you tweak the background colour by editing the html...&lt;/div&gt;&lt;br /&gt;&lt;pre style="background: #cccccc; color: #000020;"&gt;Sub IsThisRight&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;    MsgBox &lt;span style="color: #308080;"&gt;(&lt;/span&gt;IsNumeric&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: maroon;"&gt;"&lt;/span&gt;&lt;span style="color: #1060b6;"&gt;50+&lt;/span&gt;&lt;span style="color: maroon;"&gt;"&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;End Sub&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;and if the code is big you can put it in a window by wrapping a div tag around in. &lt;/div&gt;&lt;br /&gt;&lt;textarea cols="80" row="20" style="height: 49px; width: 442px;"&gt; &amp;lt;div style="height: 100px; overflow: auto; width: 100px;"&amp;gt;code in here &amp;lt;/div&amp;gt; &lt;/textarea&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="height: 100px; overflow: auto; width: 100px;"&gt;&lt;pre style="background: #cccccc; color: #000020;"&gt;Sub IsThisRight&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;    MsgBox &lt;span style="color: #308080;"&gt;(&lt;/span&gt;IsNumeric&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: maroon;"&gt;"&lt;/span&gt;&lt;span style="color: #1060b6;"&gt;50+&lt;/span&gt;&lt;span style="color: maroon;"&gt;"&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;End Sub&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Now this is probably not the most elegant solution and the background colour of the code is lost outside the code window - but I'm sure that can be easily fixed - but it works for me at the moment, but open to other suggestions.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;One Gotcha is that when you edit the html in Blogger it is too smart and wants to add extra tags and rearrange things for you. When I pasted the div code above, height and width were next to each other, but Blogger decided overflow and auto should go in between them.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;A call for help - how do I set the default font in Blogger? I can't find the option anywhere.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;3) Loading the HHP data into a database&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Well, at last something related to data mining...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;When you are doing analytics as a consultant, you get given data files in typically 3 ways.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol style="text-align: left;"&gt;&lt;li&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;an Excel spreadsheet&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;some other proprietary data file format such as SAS or SPSS&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;a text file that is an extract from a database&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&amp;nbsp;the Excel spreadsheet can be because of 2 reasons:&lt;/span&gt;&lt;br /&gt;&lt;ol style="text-align: left;"&gt;&lt;li&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Excel is the database used and where the raw data lives (worrying but very common)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The person doing the extract extracted it from a database for you and then decided they were doing you a favour by putting in Excel to pretty it up (also worrying)&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;I've had the following explanations given to me...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;"There were a few million rows in the database but I can only get 64,000 on each sheet of excel so it will take me a while to get it ready for you."&lt;/span&gt;&lt;/i&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;i&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;"I extracted it but had to put it in excel because I new some of the data was wrong so I manually edited it for you."&lt;/span&gt;&lt;/i&gt;&lt;/blockquote&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;When you hear this, just take a deep breath and do a bit of hand holding.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;You often get proprietary files sent, which basically means it can be a pain to open them if you don't have the proprietary software. These are generally binary files and are the ones that &lt;/span&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;look like hieroglyphics when you try to open them in a text editor. Binary files are a more efficient format for storing and retrieving data. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The bog standard &lt;a href="http://en.wikipedia.org/wiki/ASCII"&gt;&lt;i&gt;ascii&lt;/i&gt;&lt;/a&gt; text file extract is the most painless way to go, which is the way it works in Kaggle competitions. There are two ways you could get these files:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;ol style="text-align: left;"&gt;&lt;li&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;fixed width&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;character separated variables (*.csv files)&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The data represents fields/variables/columns/features and rows/cases/examples. Text files are clever and know when there is a new row of data, but specifying which bit of data should be in which column is another matter.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;In fixed width files, each column starts at the same place in each row. This is not necessarily best as you may have to do a lot of padding with blank spaces to get the columns aligned, which makes the file bigger. If all the data in each column has the same width, then you will get a smaller file as it saves having to put a delimiter character in there. They are also not particularly simple to load - these are the ones in&amp;nbsp;Access where&amp;nbsp;you have to click to tell&amp;nbsp;Access where the columns need to go. The field names are also generally not on the first row, they are tucked away in some other field definition file.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;a href="http://en.wikipedia.org/wiki/Delimiter-separated_values"&gt;CSV&lt;/a&gt; (C Separated Values) files put a particular character in each line to indicate where the column boundaries are.&amp;nbsp; &lt;a href="http://tools.ietf.org/html/rfc4180"&gt;There are a few rules on how the files should be generated&lt;/a&gt; (note: there used to be a wikipedia entry that said csv stood for character, not comma, but I can't find this anymore - but there is a new entry&lt;a href="http://en.wikipedia.org/wiki/Delimiter-separated_values"&gt; DSV&lt;/a&gt;.)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;The common character is a comma (","), but using this is fraught with danger which can lead to your file not loading properly.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;If you have a text field in your data where a comma is actually contained in the text, such as address ("1 Smith St, Melbourne")&amp;nbsp; the text field need to be &lt;i&gt;"qu,oted" &lt;/i&gt;in order to determine that the comma is part of the data rather than a delimiter. If these quotes are not included your data will get stuffed up and will have to be extracted again. This happens&lt;b&gt;&lt;i&gt; very often&lt;/i&gt;&lt;/b&gt;.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Also, some countries use the "," as a decimal point, which can also complicate things (see &lt;a href="http://en.wikipedia.org/wiki/Comma-separated_values"&gt;here&lt;/a&gt;).&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;When receiving data, I always ask for pipe delimited (this is the pipe character&amp;nbsp; | ) files with no quotes around string fields. The pipe character is unlikely to be present in any text fields (as not many people actually know what it is) so is a fair choice for a delimiter. Not including quotes around text, although this is defined in the standard, is going to be more friendly to most software you will use. &lt;a href="http://en.wikipedia.org/wiki/Tab-separated_values"&gt;Tab delimited&lt;/a&gt; files are also another option.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;Well, enough of the side track, back to the HHP data.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;When you receive a delimited text file, I recommend you check it before you do anything - don't assume it has been extracted correctly. The check is simply counting that each row has the same number of delimiters.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Now you could load it into Excel to eyeball and it might 'look right', but Excel won't deal with bad files, and your columns may become miss-aligned. I also don't trust Excel as it also seems to have a mind of its own sometimes and does things to data that you rather it wouldn't, and might not even realise it has done (try typing (22) into Excel, including the brackets. Dates are also a concern for me).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;You can also load the data into numerous databases using the import wizards they provide. Now some will tell you there are issues, but not all.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Microsoft SQL Server is one example of a database that &lt;i&gt;used&lt;/i&gt; to be good at checking things. I often got messages such as 'error at row 2,758,946' and the would go and check the file at this row and discover there was an issue. You would then have to correct that row, re-load, get the next error message, correct, re-load etc.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Now, in order to show you what the error message was, I tried running an example in Microsoft SQL Server 2008 R2. I used the data import wizard to load the following CSV file:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-2UVgsMwv5L4/TcoVsLLYYOI/AAAAAAAAABQ/syE6KIa2khg/s1600/baddelimiter.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-2UVgsMwv5L4/TcoVsLLYYOI/AAAAAAAAABQ/syE6KIa2khg/s1600/baddelimiter.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;and this is what it did:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-NdXAFN5le1A/TcoYcPVAUXI/AAAAAAAAABU/PNI6mALcpSg/s1600/BADDELIMITERSQL.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-NdXAFN5le1A/TcoYcPVAUXI/AAAAAAAAABU/PNI6mALcpSg/s1600/BADDELIMITERSQL.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;This is not what I was expecting. You will see column c, row 2 has the value 3,4. This is probably not what was intended.&amp;nbsp; Now the last time I used the import wizard, in version 2000, I'm sure it would have thrown an error, so beware version changes.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Anyway, data lives in a database, and there are many free ones available. Most of the major vendors offer free versions that have limitations on the size of the database, and there are also open source ones, &lt;a href="http://www.freebyte.com/programming/database/"&gt;see this list&lt;/a&gt;.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;I am going to use&lt;a href="http://www.microsoft.com/express/Database/"&gt; MS SQL Server 2008 R2&lt;/a&gt; which is free. It has a 10 gig limit per database, but you can create multiple databases and communicate between them, so unless you have a really massive file it should accommodate most needs. Installation from what I remember had a few tricky questions you probably will think '&lt;i&gt;what the **** is that on about&lt;/i&gt;' if you are not used to database lingo, but once installed it is a pretty user friendly tool.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;In order to load the data, you can use the data import wizard, but this is not something I personally like. I like to run scripts, which are sets of instructions from a file, so you can readily reproduce your steps by running the script, without having to remember any manual things you did. Scripts also make you work completely portable and shareable with others so you can work in teams. As long as you have access to the raw data, you can email your bunch of scripts to your web-mail account, fly to the other side of the world and reproduce the exact steps without having to take your computer with you. This is also why it is nice to use software that is free and downloadable from the internet.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;As I needed to frequently check files and automatically generate scripts to load the data, I wrote a tool (niftytools.exe, windows only) that has been a massive time saver. This checks the file, corrects dogy rows or removes them and generates the script to load data into SQL Server.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Here it is in use on the claims data, &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-DVNOZ8Fw5IQ/TcoiyRoJU7I/AAAAAAAAABY/xJgPoeU0JdE/s1600/niftytools.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="400" src="http://4.bp.blogspot.com/-DVNOZ8Fw5IQ/TcoiyRoJU7I/AAAAAAAAABY/xJgPoeU0JdE/s400/niftytools.PNG" width="395" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;and this is the script it generated for loading the data,&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"&gt;&lt;br /&gt;&lt;pre style="background: rgb(246,248,255); color: #000020;"&gt;&lt;span style="color: #200080; font-weight: bold;"&gt;CREATE DATABASE&lt;/span&gt; HHP&lt;br /&gt;&lt;span style="color: #200080; font-weight: bold;"&gt;USE&lt;/span&gt; HHP&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #200080; font-weight: bold;"&gt;CREATE&lt;/span&gt; &lt;span style="color: #200080; font-weight: bold;"&gt;TABLE&lt;/span&gt; Claims&lt;br /&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;br /&gt;    MemberID    &lt;span style="color: #200080; font-weight: bold;"&gt;int&lt;/span&gt;  &lt;span style="color: #595979;"&gt;-- varchar(8)  --integers starting with 0, could be text!&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    ProviderID    &lt;span style="color: #200080; font-weight: bold;"&gt;int&lt;/span&gt;  &lt;span style="color: #595979;"&gt;-- varchar(7)  --integers starting with 0, could be text!&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    Vendor    &lt;span style="color: #200080; font-weight: bold;"&gt;int&lt;/span&gt;  &lt;span style="color: #595979;"&gt;-- varchar(6)  --integers starting with 0, could be text!&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    PCP    &lt;span style="color: #200080; font-weight: bold;"&gt;int&lt;/span&gt;  &lt;span style="color: #595979;"&gt;-- varchar(5)  --integers starting with 0, could be text!&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    &lt;span style="color: #200080; font-weight: bold;"&gt;Year&lt;/span&gt;    &lt;span style="color: #200080; font-weight: bold;"&gt;varchar&lt;/span&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00;"&gt;2&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    Specialty    &lt;span style="color: #200080; font-weight: bold;"&gt;varchar&lt;/span&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00;"&gt;25&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    PlaceSvc    &lt;span style="color: #200080; font-weight: bold;"&gt;varchar&lt;/span&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00;"&gt;19&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    PayDelay    &lt;span style="color: #200080; font-weight: bold;"&gt;varchar&lt;/span&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00;"&gt;4&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    LengthOfStay    &lt;span style="color: #200080; font-weight: bold;"&gt;varchar&lt;/span&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00;"&gt;10&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    DSFS    &lt;span style="color: #200080; font-weight: bold;"&gt;varchar&lt;/span&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00;"&gt;12&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    PrimaryConditionGroup    &lt;span style="color: #200080; font-weight: bold;"&gt;varchar&lt;/span&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00;"&gt;8&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    CharlsonIndex    &lt;span style="color: #200080; font-weight: bold;"&gt;varchar&lt;/span&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00;"&gt;3&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    ProcedureGroup    &lt;span style="color: #200080; font-weight: bold;"&gt;varchar&lt;/span&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;span style="color: #008c00;"&gt;4&lt;/span&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;    SupLOS    tinyint&lt;br /&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: #200080; font-weight: bold;"&gt;BULK&lt;/span&gt; &lt;span style="color: #200080; font-weight: bold;"&gt;INSERT&lt;/span&gt; Claims&lt;br /&gt;&lt;span style="color: #200080; font-weight: bold;"&gt;FROM&lt;/span&gt; &lt;span style="color: #1060b6;"&gt;'E:\comps\hhp\raw data\HHP_release2\Claims.csv'&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #200080; font-weight: bold;"&gt;WITH&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;(&lt;/span&gt;&lt;br /&gt;MAXERRORS &lt;span style="color: #308080;"&gt;=&lt;/span&gt; &lt;span style="color: #008c00;"&gt;0&lt;/span&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;&lt;br /&gt;FIRSTROW &lt;span style="color: #308080;"&gt;=&lt;/span&gt; &lt;span style="color: #008c00;"&gt;2&lt;/span&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;&lt;br /&gt;FIELDTERMINATOR &lt;span style="color: #308080;"&gt;=&lt;/span&gt; &lt;span style="color: #1060b6;"&gt;','&lt;/span&gt;&lt;span style="color: #308080;"&gt;,&lt;/span&gt;&lt;br /&gt;ROWTERMINATOR &lt;span style="color: #308080;"&gt;=&lt;/span&gt; &lt;span style="color: #1060b6;"&gt;'\n'&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #308080;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Note it has determined the maximum size the varchar (variable length character) fields need to be for optimal size, which makes the database as small as possible. It also determines what type of integer is most efficient. You will see that the &lt;i&gt;tinyint &lt;/i&gt;is not coloured, which is because this is a relatively new datatype and the web page mentioned earlier that did this syntax highlighting obviously hasn't got it in its list of special words. I emailed the developer, so we shall see if it changes. Also the field name Year is highlighted because it as also a reserved word in SQL Server. When this happens it can be sometimes worthwhile renaming the field (just change the word Year to something else).&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="background-color: white;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;One other thing worth noting, there are comments starting with&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: white;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: #cccccc; color: #38761d;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;-- integers starting with 0&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: white;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="background-color: white;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&amp;nbsp;It was noted on the HHP forum that the IDs in the submission entry example did not match the ids in the data. This is because they started with zeros, so whatever software was used to generate the submission example had treated them as integers and lost the leading zero. Thanks to this I was able to modify my nifty tool to generate warnings if this is the case, and if gives you an alternative if you want to treat the field as a character field, which will preserve the zero.&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: white;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: white;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Leading zeros are sometimes important, as 003 and 03 are not the same, but would become the same if you treated them as integers. I learned this when I used to deal with address files that had postcodes beginning with 0.&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: white;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: white;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Another 'bug' I discovered in my nifty tool was that "55+" passes the isNumeric test (try in an Excel macro - see the sample code at the top of this post) , when it is not really, and SQL Server won't accept it. This improved the tool again, and highlighted the fact that there was some data like this that would need to be closely examined later on. &lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif;"&gt;Running the generated script will load the data and it can then be viewed (click on the image below to see how)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Z8RXO57JaK0/TcZ0KlQ-lMI/AAAAAAAAAA0/P023rcNaqjQ/s1600/selecttop10.PNG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="152" src="http://3.bp.blogspot.com/-Z8RXO57JaK0/TcZ0KlQ-lMI/AAAAAAAAAA0/P023rcNaqjQ/s400/selecttop10.PNG" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;So we are on our way! All the data is in and we can take a peak at it to see what it looks like.&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;If you want to use niftytools to check and help load data then it is freely available. You will first need to download and install my&lt;a href="http://www.tiberius.biz/"&gt; Tiberius package&lt;/a&gt;, and then look in the program files folder to find it. Note you will need dot net 4 - see&lt;a href="http://anotherdataminingblog.blogspot.com/2011/05/damn-computers.html"&gt; this blog post&lt;/a&gt; find out why.&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial, Helvetica, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-6605498294152192587?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/6605498294152192587/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/progress-loading-hhp-data.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6605498294152192587'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/6605498294152192587'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/progress-loading-hhp-data.html' title='Progress - Loading the HHP Data'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-2UVgsMwv5L4/TcoVsLLYYOI/AAAAAAAAABQ/syE6KIa2khg/s72-c/baddelimiter.PNG' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-2570930724182624821</id><published>2011-05-06T14:32:00.000-07:00</published><updated>2011-05-07T03:50:40.546-07:00</updated><title type='text'>Who needs the HERITAGE HEALTH PRIZE anyway?</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-WAgzSs1YFNA/TcRthTnd0LI/AAAAAAAAAAU/sEszfGSVL60/s1600/counter.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Just been playing around with this blogging software and discovered that it wants to pay me. Not sure how this works (I presume writing HERITAGE HEALTH PRIZE everywhere will help?) and whether posting this post is breaking the agreement I just agreed to (something to do with not tricking people to click on the adds). Anyway, lets see if I can top $3 million in 2 years! Click away guys and girls.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-gb9I76WjJ3U/TcRm0lCBqfI/AAAAAAAAAAQ/-ha9KQ_J9_k/s1600/earnings.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-gb9I76WjJ3U/TcRm0lCBqfI/AAAAAAAAAAQ/-ha9KQ_J9_k/s1600/earnings.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;HERITAGE HEALTH PRIZE&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;DATA MINING&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;KAGGLE&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;HERITAGE HEALTH PRIZE ;-)&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;The number of visitors will obviously have to increase...&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-WAgzSs1YFNA/TcRthTnd0LI/AAAAAAAAAAU/sEszfGSVL60/s1600/counter.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-WAgzSs1YFNA/TcRthTnd0LI/AAAAAAAAAAU/sEszfGSVL60/s1600/counter.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt; &lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;b&gt;UPDATE&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;I had a visitor (the one person I told),&amp;nbsp; as informed by the email below which came from the web counter you will see on the top right (which I discovered by clicking the 'next blog' button and saw on another blog).&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-31aDepTFbOg/TcS2lqG7tNI/AAAAAAAAAAY/UyUI_ue4-HE/s1600/firstvisitor.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-31aDepTFbOg/TcS2lqG7tNI/AAAAAAAAAAY/UyUI_ue4-HE/s1600/firstvisitor.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;span style="font-size: small;"&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt; &lt;/b&gt;&lt;/div&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;Note in the email is a target add. Very clever - the person is from Australia and they want me to visit them!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;Google just informed me my adds are up, with an email that says,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-3Ms6Uuxg3n8/TcS37OWi2PI/AAAAAAAAAAc/blhiKCHbWlg/s1600/addsense.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-3Ms6Uuxg3n8/TcS37OWi2PI/AAAAAAAAAAc/blhiKCHbWlg/s1600/addsense.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;I guess google has some smart algorithm that determines relevance based on what I am gibbering on about. I think I will have to start talking about Data Mining and hopefully we will see some change. The first add I got was a bit worrying...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-4OySoGISJ4o/TcS4ohRHbcI/AAAAAAAAAAg/6G_EslWcfm8/s1600/add.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-4OySoGISJ4o/TcS4ohRHbcI/AAAAAAAAAAg/6G_EslWcfm8/s1600/add.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;&lt;b&gt;UPDATE 2&lt;/b&gt;.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;While I type another visitor. Hi Dave! And another email. Gee, their algorithm must have cost millions to develop. Can't wait for my first 'visit the South Pole' email.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-gTTOQ1mCFok/TcS6sT6gtDI/AAAAAAAAAAk/yoC1d2siRB4/s1600/secondvisitor.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-gTTOQ1mCFok/TcS6sT6gtDI/AAAAAAAAAAk/yoC1d2siRB4/s1600/secondvisitor.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;b&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;b&gt;&lt;span style="font-size: small;"&gt; &lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;b&gt;UPDATE 3 - relevant adds?&lt;/b&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;This add probably getting a bit more relevant, it's a competition and only 3 zeros missing, otherwise spot on! &lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-QVe8tQ11eP4/TcUbEabj7WI/AAAAAAAAAAo/ZgzisyZF7DM/s1600/competitionadd.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-QVe8tQ11eP4/TcUbEabj7WI/AAAAAAAAAAo/ZgzisyZF7DM/s1600/competitionadd.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;And this double add has picked where 2 of my 3 visitors have come from (although probably more likely it knows where I am), although I've no idea what the first is is advertising, the second has picked up on the word heritage. Pretty impressive so far - but still the kitty is empty!&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-4t14nIifSoI/TcUdlEvobUI/AAAAAAAAAAw/P4K_AOuBal0/s1600/MelbourneAdd2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-4t14nIifSoI/TcUdlEvobUI/AAAAAAAAAAw/P4K_AOuBal0/s1600/MelbourneAdd2.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: small;"&gt; &lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-2570930724182624821?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/2570930724182624821/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/who-needs-heritage-health-prize-anyway.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/2570930724182624821'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/2570930724182624821'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/who-needs-heritage-health-prize-anyway.html' title='Who needs the HERITAGE HEALTH PRIZE anyway?'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-gb9I76WjJ3U/TcRm0lCBqfI/AAAAAAAAAAQ/-ha9KQ_J9_k/s72-c/earnings.PNG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8236580214478682937.post-712303129003531621</id><published>2011-05-06T14:06:00.000-07:00</published><updated>2011-05-06T14:06:22.273-07:00</updated><title type='text'>The day I led the HHP (briefly)</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;The $3 million HHP competition started on&lt;a href="http://www.kaggle.com/Competitions"&gt; Kaggle&lt;/a&gt; the other day. I was briefly leading (hope the leaderboard looks similar in 2 years time). Ecstatic in my efforts, I checked out who was online to brag to and discovered Dave.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;i&gt;'Hey Dave - winning $3 million and haven't even downloaded the data yet!'&lt;/i&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;Lesson 1: Dave is smarter then me!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;a href="http://2.bp.blogspot.com/-EkKQLf3tAS0/TcRfHXHviZI/AAAAAAAAAAM/poGPRrOmgYo/s1600/leaderboard.PNG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-EkKQLf3tAS0/TcRfHXHviZI/AAAAAAAAAAM/poGPRrOmgYo/s1600/leaderboard.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;i&gt; &lt;/i&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt;&lt;i&gt; &lt;/i&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: Arial,Helvetica,sans-serif;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;pre&gt;&lt;/pre&gt;&lt;pre&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8236580214478682937-712303129003531621?l=anotherdataminingblog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://anotherdataminingblog.blogspot.com/feeds/712303129003531621/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/day-i-led-hhp-briefly.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/712303129003531621'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8236580214478682937/posts/default/712303129003531621'/><link rel='alternate' type='text/html' href='http://anotherdataminingblog.blogspot.com/2011/05/day-i-led-hhp-briefly.html' title='The day I led the HHP (briefly)'/><author><name>Sali Mali</name><uri>http://www.blogger.com/profile/14741877332990090234</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-EkKQLf3tAS0/TcRfHXHviZI/AAAAAAAAAAM/poGPRrOmgYo/s72-c/leaderboard.PNG' height='72' width='72'/><thr:total>0</thr:total></entry></feed>
