Another Data Mining Blog: Minimise the what?

The other week I was on Amazon to purchase Data Mining with R, and Amazon did some cross-selling, telling me I might be interested in ggplot2. Now I remember listening to a talk by Jeremy Howard, and seemed to recall he mentioned this book was worth a read, so I ticked the approproate box, and now the books are here.

I prefer looking at pictures to interpret information, so I wanted to see if I could use my new book to figure out what this log function was all about that the HHP prize people want us to minimize.

The following plots are all variations of the same data, plotted with different packages and functions. The generated data is basically what the errors would be for all possible combinations of Actual Days In Hospital v Predicted Days in Hospital.

Now generally, if you predict 15 and the real answer is 14, that is considered just as good as if you predict 2 and the answer is 1 - both have the same error - 1. Many methods of determining what is a good prediction, such as the mean squared error (MSE), don't care about the magnitude of the prediction, just the magnitude of the error.

Other metrics such as mean absolute percentage error (MAPE) do consider the magnitude. In electric load forecasting, if the actual load is 2MegaWatts (MW) and you have forecast 1MW, it can be far more serious than if the actual load is 900MW and you have forecast 899MW. In both cases the error is only 1MW but the consequences can be quite different.

So what does the error function of the HHP tell us?

Basically it says that we are penalised less for the same absolute error the higher the actual Days in Hospital is. So rather than concentrate on improving the prediction for someone who was 15 days in hospital and we only predicted 14 days, we would get far more reward by putting the effort into improving the prediction for someone who was 2 days in hosptal and we only predicted 1 day.

or...

a small improvement in accuracy for a small Days in Hospital is worth as much as a large improvement in accuracy for a large Days in Hospital.

Now all this insight was gleaned from just creating a graph. No Maths!

Now if we zoom in to what is the business end (remember 0.2 is a good guess) we see things are a bit more linear...

And here is what a we get for a normal type of error (just strip out the log and +1 from the data generation function). Note the lines are parallel.

and for percentage errors, note the lines are converging, similar to the log error, but the worse position to be in is making high predictions when the actual is low.

Here is the R code to generate the plots...

#############################################
# generate the data
#############################################

dih <- seq(from=0, to=15, by = 0.2)
dat <- expand.grid(act = dih, pred = dih)
dat$err <- sqrt((log(dat$pred+1) - log(dat$act+1)) ^ 2)
dat1 <- matrix(sqrt((log(dat$act+1) - log(dat$pred+1)) ^ 2),length(dih))



###########################################
# now plot the error surface of act v pred
# using several methods
###########################################
library(ggplot2)
library(lattice)
library(vcd)
require(grDevices) # for colours


### using ggplot2 ###
df <- data.frame(dat)
names(df) <- c("actual","predicted","error")

#version1
errplot <- ggplot(df, aes(actual,predicted, fill=error))
errplot <- errplot + geom_tile()
errplot <- errplot + scale_x_continuous(expand = c(0,0)) 
errplot <- errplot + scale_y_continuous(expand = c(0,0))
errplot

#version2
fill_gradn <- function(pal) {
  scale_fill_gradientn(colours = pal(7),limits = c(0,.5))
}
errplot <- errplot + fill_gradn(rainbow_hcl)
errplot


### using wireframe ###
wireframe(err ~ act * pred, data = dat
          ,scales = list(arrows = FALSE)
          ,drape = TRUE
          ,colorkey = TRUE,
          screen = list(z = 30, x = -60))
    
contourplot(err ~ act * pred, data = dat
          ,region = TRUE
          ,cuts = 10
          ,col.regions = terrain.colors
          )
    
    
## filled contour plot
filled.contour(x = dih, y = dih, z = dat1
                ,nlevels=10
                ,color = terrain.colors
                ,plot.title = title(main = "HHP Error Function (the funny log one!)",
                xlab ='actual DIH', ylab = "predicted DIH")
                ,key.title = title(main="Error"),
                )

Another Data Mining Blog

Saturday, 21 May 2011

Minimise the what?

No comments:

Post a Comment