Monday, 26 December 2011

$3 million - hmmm...

I've shown this plot a couple of times already, but here it is again, with another that shows the bar we have to limbo under to win $3 million.

Anyone think this is possible?

Any predictions of what the final winning error will be?

I predict 0.453

Place your bets here...


  1. I think the winning team just barely cracks .45 .... say .4480 .

    The only way anybody gets close to 0.40 is if there's a major supplemental release of much more detailed data. I could see that happening shortly after Milestone #3, when it's beyond obvious that all the info content in the available data has been squeezed out, and the #6 through #20 teams have gone into hibernation. They may succumb to the need to bump up the (P.R.) ratings then.

  2. It's too early to predict now. The top teams are probably just overfitting the public board for the lack of better things to do. The approaches that were described in the last milestone papers are very weak from theoretical point of view - basically just applying some general ML methods on variables arbitrarily derived from the dataset, a kind of blind search.

  3. @redstr

    It's too early to predict now.


    The approaches that were described in the last milestone papers are very weak from theoretical point of view.

    Who cares about theory. The proof of the pudding is in the eating!

  4. Too early because there's over a year left still. It't less than half way through. Many people can still join in.

    The theory can get very practical if something important is discovered. Imagine using current ML techniques to predict, say, movement of the planets on the sky. You can do it with some accuracy from raw 2D coordinates if a ton of observations is provided, but if you realize that they're actually projections from 3D space of heliocentric system, that will boost your models' accuracy to a previously unimaginable point.

    At the last milestone, both winners had zero of such insight into the problem, that's what I meant by the theoretical weakness. For me, the content of the milestone papers was utterly disappointing. This has been noted by others on the forum as well. I think this contest, given the prize, deserves more fundamental research.

  5. (Haven't checked in here for a while, or I would have posted this earlier.)

    @RedStr: Your comments rub me the wrong way. Were you aware that:
    * this "Sali Mali" is one of the 3 members of the #1 team at Milestone1,
    * that he and David Vogel are both highly experienced professional data miners,
    * that David Vogel worked for a number of years at one of the leading commercial companies doing data mining and predictive algorithms on HOSPITAL data in the USofA, and
    * that the 3rd member of the team is an MD with significant experience in the data aspects of measuring and managing population health?

    Obviously we don't know the backgrounds of many of the other people on the various competing teams. But this particular team will be hard to top in terms of relevant experience and background.

    It is a requirement of the Kaggle/Heritage Health Prize rules that winning teams describe their algorithms to a sufficient degree that other competent scientists can reproduce/corroborate the reported results. So the "technical algoritm" part of the Intellectual Property must be shared.

    But there is no requirement to share the thinking or explain the subject matter expertise that inspired the team to create the predictive features they used to get to the lead in Round One, nor to give any glimpses into what they have up their sleeves for Round 2 and beyond. That part of their I.P. they can protect, and not surprisingly, they have largely done so.

    IF you have expertise in this domain, and study their feature list and methods, you will find there is considerable insight and knowledge being brought to bear. For example, they have one feature that had not occurred to me, but immediately upon seeing this field I was reminded of published research on that particular aspect.

    Bear in mind too that this data is highly censored, masked and summarized. There are many known insights that cannot be applied because the data is so sterilized for public distribution.

    Apologies if this comes across as cranky.