Another Data Mining Blog: August 2011

Wednesday, 31 August 2011

First Milestone

The time has now been and gone for submitting an entry for the first milestone, and the chance to win either $30,000 or $20,000 for the top two places.

The images below are the state of play (on the 30% visible data!) just after the first milestone ( I didn't capture this image in time so some teams have made an extra submission since).

There was a bit of movement in the last 2 days.

Tom & Jerry changed their name to reveal themselves (Petterson & Caetano). This was interesting for me as they are from a relatively new machine learning lab here in Australia that I had not been aware of. They jumped up to 3rd position before the last submission but were overhauled at the last minute
Willem Mestrom made a big leap on his last submission to get into 3rd position. He was obviously keeping his best (or was he?) until the very last moment. This might be a good tactical play if he looking towards the $3 million rather than the short term smaller prizes.
Planet Melbourne, after making consistent daily submissions throughout the competition did not take the opportunity to make one last submission.
It is good to see Chris R still in the top 10. Chris has been up there since the very start.Interestingly Chris was cut off from submitting for a while due to Hurricane Irene taking his power away.

Interestingly, positions 2,3 & 4 are now taken by members of The Ensemble from Netflix fame. Position 1 contains Dave, who was in an original team called Ensemble Team, and also entered Netflix as a member of Ensemble Experts , and I (also a member of Market Makers) ran a competition on Ensembling the Netflix Data and have looked at ensembling other competitions.
There seems to be a pattern here!

Note you can click on the images to enlarge them.

The plot below shows the Lowess line fit of the error as a function of the number of entries submitted.

Saturday, 27 August 2011

First Milestone Just Around the Corner

There is about 4 days to go until the deadline for the first milestone, so I thought it would be interesting to capture the leaderboard now and then in 4 days, to see if things change wildly. At this stage it can be like a game of poker.

All the top 10 teams are still in the hunt and making daily submissions. Tom & Jerry were late starters but have made astonishing progress to 3rd place. They currently split 3 others who were part of 'The Ensemble', Edward, Willem and B.Yang.

So many teams at the top all have a proven history in data mining competitions. I will be very happy if those ensemble dudes repeat their performance in the Netflix Prize ;-).

Friday, 19 August 2011

1 minute of fame

The HHP competition and Kaggle are catching the attention of the media over here in Australia.

I was on TV recently (and yes, it does add a few Kg to you),

http://www.abc.net.au/catalyst/stories/3296837.htm

and in the paper (2nd last paragraph interesting!),

http://www.smh.com.au/business/miners-of-a-different-kind-in-line-for-riches-of-their-own-20110731-1i6bi.html

Blog posts may be limited in the next 2 weeks as its nose down to try and catch Edward.

Wednesday, 3 August 2011

Bath Mawr Coch

In one of my first posts you will see how my eagerness to gloat of my success unleashed Dave into the HHP. He has subsequently been at the top of the leaderboard from the beginning, whereas my progress was more gradual. Eventually we seemed to converge to a similar score and even held positions 1 & 2 for a while, which was very surprising to me given the caliber of contestants that this competition has attracted.

We have now decided to join forces, hence the appearance of team Bath Mawr Coch. Hopefully there will be some synergy in our methods.

Dave and I are both competition junkies. We first cam into contact as a result of the KDDCup 2004 contest, which Dave won - and you will see my effort in a lowly 11th place. The leaderboard was left open after the conclusion of the contest, and I eventually got to top spot, which is obviously not that impressive when the winners had already revealed what they did - but I had learnt things and been able to implement them. I then contacted Dave to gloat - and he immediately overtook me.

We have since worked on a couple of other competitions together, and won the NCDM Analytic challenge in 2008.

Competitions like the HHP are a great way to learn and to also meet others with similar interests. I'm sure many collaborations and friendships will be made as a result, to the benefit of everyone.

Tuesday, 2 August 2011

Gone Shopping

Another New Comp

Kaggle have just posted a new competition Dunnhumby's Shopper Challenge that is a very interesting one. The data is historical records of when customers visited a store and how much they spent. The goal is to predict when customers will next visit the store and how much they will spend.

This is probably a novel real world data set and I expect the interest in this competition to be high. The reasons I think this are

The data is simple
There is not a massive amount of data, so processing power will not be an issue
It is a novel problem where creativity is needed and a new algorithm will probably have to be developed. It is not going to be a case of who can pre-process the data best and build the best ensemble using existing algorithms

So, where to start?

The first thing I always do is run the data through my Nifty Tool (see here for example) to check it and generate some SQL so I can load it into a database.

Here is what comes out...

CREATE DATABASE dunnhumby
USE dunnhumby

CREATE TABLE training
(
    customer_id    int
,    visit_date    date
,    visit_spend    float
)


BULK INSERT training
FROM 'E:\comps\dunnhumby\training.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

--(12146637 row(s) affected)


CREATE TABLE test
(
    customer_id    int
,    visit_date    date
,    visit_spend    float
)


BULK INSERT test
FROM 'E:\comps\dunnhumby\test.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

The data set is relatively nice, just 3 coulumns;

customer ID
date
amount spent

the dates are from 1st April 2010 to 18th July 2011.

A Naive Guess

I always like to get in a submission without actually really looking at the data, just to make sure I am reading the data correctly and that the submission file is in the correct format.

In forecasting, a naive prediction is not going to be far wrong in the long run. If you want to forecast the weather, then saying tomorrow is going to be the same as today is going to serve you well in the long run if you have no other information to go off. So for this problem, we will say the next spend will be the same as the last spend, and the gap in days will also be the same.

To achieve this in SQL is a bit tricky, but can be done. The code below does the trick and when submitted gets 9.5% correct on the leaderboard, the same as the simple baseline benchmark.

/*************************************************
SQL to generate a naive prediction
- spend will be the same as the last visit
- the next visit will be in the same number
  of days as the gap between the previous visit
**************************************************/
 
-- append a visit number to the data - visit 1 = most recent visit
select *, Rank() over (Partition BY customer_id order by visit_date desc) as visit_number
into #temp1
from test

-- create field which is days since previous visit
select a.*
,b.visit_date as previous_visit_date
,DATEDIFF(DD,b.visit_date,a.visit_date) as days_since_previous_visit
into #temp2
from #temp1 a inner join #temp1 b
on a.customer_id = b.customer_id
and a.visit_number = b.visit_number - 1
where a.visit_number = 1

-- generate the submission file, makink sure 1st April is earliest data
select

customer_id
,
    (case 
        when dateadd(dd,days_since_previous_visit,visit_date) < '2011-04-01' 
        then '2011-04-01' 
        else dateadd(dd,days_since_previous_visit,visit_date)
    end) 
    as visit_date
,visit_spend
from #temp2
order by customer_id

Now Lets Look at the Data

In order to get a feel for things, we will aggregate the data to a daily level and look at how the population as a whole goes shopping. We will create a count of the total number of visits per day and generate some time based fields that will help us understand the daily and seasonal patterns in shopping. The SQL below will generate that data, which is small enough to save in Excel or as a text file and then load in your favourite analysis package, or we could just create a table in the database and load directly from there (don't leave as a view as it takes a while to generate).

-- add a field which is the days since 1st April (1st April 2010)
alter table training add daysSinceStart int
update training set daysSinceStart = DATEDIFF(dd,'2010-04-01',visit_date)

-- create a daily summary dataset
select visit_date

-- fields of interest
, COUNT(*) as visits
, avg(visit_spend) as avg_visit_spend
, stdev(visit_spend) as stdv_visit_spend

-- time based predictor variables
, min(daysSinceStart) as daysSinceStart
, sin(2 * pi() * (DATEPART(DAYOFYEAR, visit_date) * 1.0 / 365.0)) as doySin 
, cos(2 * pi() * (DATEPART(DAYOFYEAR, visit_date) * 1.0 / 365.0)) as doyCos
, (case when (DATEPART(WEEKDAY, visit_date)) = 1 then 1 else 0 end) as dowSun
, (case when (DATEPART(WEEKDAY, visit_date)) = 2 then 1 else 0 end) as dowMon
, (case when (DATEPART(WEEKDAY, visit_date)) = 3 then 1 else 0 end) as dowTue
, (case when (DATEPART(WEEKDAY, visit_date)) = 4 then 1 else 0 end) as dowWed
, (case when (DATEPART(WEEKDAY, visit_date)) = 5 then 1 else 0 end) as dowThu
, (case when (DATEPART(WEEKDAY, visit_date)) = 6 then 1 else 0 end) as dowFri
, (case when (DATEPART(WEEKDAY, visit_date)) = 7 then 1 else 0 end) as dowSat

from dbo.training
group by visit_date
order by visit_date asc

Tiberius is an ideal tool for analysing this type of data, as I initially wrote it to look at a similar type of data set - electricity consumption - and the patterns I want to discover will be similar.

I have built a model to predict visits as a function of time. The model predictors are weekly, seasonal and trend components, so we are trying to model the number of visits based solely on time. The plot below shows the errors in the model, the points are in date order.

(Click on the plots to enlarge)

What we see are that there are certain dates where the errors stand out. These are easily identifed in Tiberius by rolling your mouse over the image and the dates are displayed.

The two big anomalies are 25th Dec 2010 and 1st Jan 2011 - Christmas Day and New Years Day. So if we are trying to deduce where this data set is from, China is probably crossed off the list.

Other dates with errors are

4th April 2010
3rd May 2010
31st May 2010
30th August 2010

A bit of research will identify that these are all England & Wales public holidays, so I think we have figured out where the data set originates from (hello Tesco!).

The model of visits can be decomposed into weekly, annual and a trend components.

The weekly component shows Sunday is the quietest day, and Friday the busiest day. Monday, Tuesday and Wednesday are all similar.

The Seasonal component shows early April is the busiest time of year.

And unfortunately the trend looks to be going downhill.

Finally here is an interesting plot of the standard deviation of the spend,

So we are on our way to understanding this problem.

Good luck everyone, hope the challenge keeps you busy!