Another Data Mining Blog: 2019

Sunday, 20 October 2019

Milestone Prize Solution

2019
<< next previous >>

Congratulations to Samuel Messina for winning the $1,000 milestone prize. The write-up of his technique can be found here.

Sunday, 15 September 2019

Buy Low Sell High - leaderboard and rule clarification

2019
<<Previous Next>>

You can now register your team and submit your trades files to see how you do on the leader board:

https://melbournedatathon.com

Just to clarify a few things:

You should only use the values of the current and previous predictions to make a decision
If you use future prediction values then you will probably get a good result but this logic is unimplementable
Any strategies found to be using future prediction values will be disqualified, so there is no point in trying
Do not use anything else in your algorithm such as pair name, price or absolute time
If you want to use previous prediction values, then use the relative time differences to determine them
The new file we provide will be consistent in having predictions generated every 5 minutes intervals, but the absolute values could be anything
After the deadline, you will be asked to nominate your 3 long and 3 short strategies you want to be evaluated
All teams beating the Benchmark solution on the private leader board for their nominated strategies will then qualify for stage 2
We will then invite those teams to run their code over several new files. They must only use the strategies nominated and the strategies must be locked in with no further parameter tuning allowed
The new files will have different pair names and the start time for the field minutesSinceStart will not necessarily be the same start time as the file already provided
The winners will be the team that gives the best return on the new data providing they still beat the Benchmark and we are confident no future prediction values have been used.
There will be a winners for Short and a winners for Long
If we suspect future prediction are being used then we will say how we came to this conclusion and the team will have a right of reply to prove otherwise
There will be a benchmark prize for the first leading team on the private leaderboard as at 12 pm on Thu Oct 17th that wishes to reveal their method. In order to receive the prize the team must write a blog post describing their method so it can be reproduced by others. It must not use future information. It is not compulsory to reveal your method, so we will proceed down the ranking and award the prize to the first team that wishes to do so.

Just to clarify what we mean by 'future information'. The data set does contain records that are 'in the future' to the times we have asked you to make decisions for. It is OK to use this data to come up with a set of coefficients for a model.
What it is NOT OK to do is use the raw prediction values at time 'x' as inputs to a model making a decision for a time prior to 'x'. This is an unimplementable solution.

If there is anything else that needs clarifying, ask below and we will add to the list

Good Luck

Thursday, 5 September 2019

Buy Low Sell High - file checking

2019
<<previous next>>

In the previous post there was code to generate a submission file.

Submission files need to be valid so we can process them, otherwise they will be rejected by the calculation engine. Below is some code which your file should pass to ensure it will not be rejected.

#----------------------------------------------------------
#
# some R code to test your submission file is valid
#
#----------------------------------------------------------

library(data.table)

#define where the data is
fileToTest <- 'test_submission_file_1.csv'


dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,fileToTest)

#read in the data
tradeData <- fread(theDataFile)

#set error message
ermsg <- 'OK'
  
  #these are the column names required
  requiredCols <- c('strategyName','keys_pair','enterTime','exitTime','direction','percentPair')
  
  #------------------------
  #do some basic checks
  #-----------------------
  if (!identical(sort(intersect(colnames(tradeData),requiredCols)) , sort(requiredCols))) {
    missing_cols <- setdiff(requiredCols,colnames(tradeData))
    ermsg <- paste('\n incorrect column names in the trades file. We are mising:\n',paste(missing_cols,collapse = ","))
  } else if (class(tradeData$enterTime) == "character") {
    ermsg <- '\n the field enterTime contains non numeric values'
  } else if (class(tradeData$exitTime) == "character") {
    ermsg <- '\n the field exitTime contains non numeric values'
  } else if (class(tradeData$direction) == "character") {
    ermsg <- '\n the field direction contains non numeric values'
  } else if (class(tradeData$percentPair) == "character") {
    ermsg <- '\n the field percentPair contains non numeric values'
  } else if (nrow(tradeData) == 0){
    ermsg <- '\n the file contains no records'
  } else if(length(which(is.na(tradeData)==TRUE)) > 0) {
    ermsg <- '\n the file contains missing values' 
  } else if (min(tradeData$exitTime - tradeData$enterTime) <= 0) {
    ermsg <- '\n the exit time needs to be after the entry time'
  }
  

  if (ermsg != 'OK'){
    stop(ermsg)
  }


  #-------------------------------------
  # check that we are not over 100% in 
  # a trade at any particualr time
  #-------------------------------------
  timeStep <- 5
  
  tradeData[,xKey := paste(keys_pair,strategyName,direction,sep="_")]
  allKeys <- unique(tradeData$xKey)
  
  for (thisKey in allKeys){
    
    pairStrategyTradeData <- subset(tradeData,xKey == thisKey)
    E_numberOfTrades <- nrow(pairStrategyTradeData)
    
    cat("\n",thisKey,E_numberOfTrades)
    
    #get the times we are in a trade
    for (tradeNumber in 1:E_numberOfTrades){
      thisTimes1 <- seq(from = pairStrategyTradeData[tradeNumber,enterTime],to=pairStrategyTradeData[tradeNumber,exitTime- timeStep],by=timeStep )
      thisPercent1 <- rep(pairStrategyTradeData[tradeNumber,percentPair],length(thisTimes1))
      if (tradeNumber==1){
        allTimes1 <- thisTimes1
        allPercent1 <- thisPercent1
      } else {
        allTimes1 <- c(allTimes1,thisTimes1)
        allPercent1 <- c(allPercent1,thisPercent1)
      }
    } #tradeNumber
    
    #make sure the max percent at any one time is not > 1
    d <- cbind.data.frame(allTimes1,allPercent1)
    posPercents <- aggregate(allPercent1 ~ allTimes1,data=d,sum)$allPercent
    
    maxPercent <- max(posPercents)
    if (maxPercent > 1){
      aBadTime <- d[which(aggregate(allPercent1 ~ allTimes1,data=d,sum)$allPercent > 1)[1],'allTimes1']
      ermsg <- paste("\n you cannot be more than 100% in a trade at any one time:\n",thisKey,"\ntime=",aBadTime,"\npercent =",maxPercent)
      stop(ermsg)
    }
    
  } #  thisKey

  cat('\nthe file appears to be OK!')

Update 17th Oct 2019.
The file validator is now more strict. The basic checks section in the above code has now expand to the following set of rules:




requiredCols <- c('strategyName','keys_pair','enterTime','exitTime','direction','percentPair')
 
requiredPairs <- paste0('pair_',1:22)
 

 
#do some checks
 
if (!identical(sort(intersect(colnames(tradeData),requiredCols)) , sort(requiredCols))) {
 
missing_cols <- setdiff(requiredCols,colnames(tradeData))
 
ermsg <- paste('\n incorrect column names in the trades file. We are mising:\n',paste(missing_cols,collapse = ","))
 
} else if (class(tradeData$enterTime) == "character") {
 
ermsg <- '\n the field enterTime contains non numeric values'
 
} else if (class(tradeData$exitTime) == "character") {
 
ermsg <- '\n the field exitTime contains non numeric values'
 
} else if (class(tradeData$direction) == "character") {
 
ermsg <- '\n the field direction contains non numeric values'
 
} else if (class(tradeData$percentPair) == "character") {
 
ermsg <- '\n the field percentPair contains non numeric values'
 
} else if (nrow(tradeData) == 0){
 
ermsg <- '\n the file contains no records'
 
} else if(length(which(is.na(tradeData)==TRUE)) > 0) {
 
ermsg <- '\n the file contains missing values'
 
} else if (min(tradeData$exitTime - tradeData$enterTime) <= 0) {
 
ermsg <- '\n the exit time needs to be after the entry time'
 
} else if (nrow(tradeData) < 50){
 
ermsg <- '\n not enough trades'
 
} else if (max(tradeData$direction) > 1){
 
ermsg <- '\n trade direction > 1'
 
} else if (min(tradeData$direction) < 0){
 
ermsg <- '\n trade direction < 0'
 
} else if (length(unique(tradeData$direction)) > 2){
 
ermsg <- '\n too many trade directions'
 
} else if (max(tradeData$enterTime %% 5) > 0) {
 
ermsg <- '\n wrong times'
 
} else if (max(tradeData$exitTime %% 5) > 0) {
 
ermsg <- '\n wrong times'
 
} else if (min(tradeData$percentPair) < 0) {
 
ermsg <- '\n negative percent pairs'
 
} else if ( length(intersect(unique(tradeData$keys_pair),requiredPairs)) == 0 ){
 
ermsg <- '\n you need to predict for all pairs'
 
} else if ( !identical(sort(intersect(unique(tradeData$keys_pair),requiredPairs)),sort(requiredPairs))){
 
ermsg <- '\n you need to predict for all pairs'
 
} else if (min(tradeData[,.N,by = c('keys_pair','direction')]$N) < 5) {
 
ermsg <- '\n you need at least 5 trades pairs per '
 
}

Tuesday, 27 August 2019

Buy Low Sell High - a simple strategy

2019

<< previous post (getting the data) next>>

Below is some R code to generate a valid submission file for a very simple strategy.

Is up to you to find the strategies that will give the highest returns.

In the next post we'll introduce the leaderboard where you can upload your strategies to.


############################################################
#
# sample R code for the 2019 Melbourne Datathon that will 
# generate a valid submission file.
#
############################################################



library(data.table)


#----------------------------------------------
#define where the data is & load
#----------------------------------------------
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,"melbdatathon2019_buylowsellhigh.csv")
dt <- fread(theDataFile)


#-------------------------
# strategy settings
#-------------------------
feePerc <- 0.0015 #fixed const for the competition
longThresh <- 0.002 #the value the prediction has to be higher than
longRemainWindow <- 12 #the number of hours to remain in a trade


#---------------------------------------------------
# flag if the criteria for entry is passed
#---------------------------------------------------
dt[,longEntry := 0]
dt[Lpred7b >= longThresh, longEntry := 1]

#----------------------
# set the exit times
#----------------------
exitLongs <- subset(dt,longEntry == 1,select = c('keys_pair','minutesSinceStart'))
exitLongs[,minutesSinceStart := minutesSinceStart + (longRemainWindow * 60)]
exitLongs[,longExit := 1]

#-------------------------------
#merge the entry and exit times
#-------------------------------
myKeys <-  c("keys_pair","minutesSinceStart")
setkeyv(dt,myKeys)
setkeyv(exitLongs,myKeys)
dt <- merge(dt,exitLongs,all.x=TRUE)

#-----------------------------
# no match = no exit 
#-----------------------------
dt[is.na(longExit), longExit := 0]

#--------------------------------------------------
# if there is an exit and enter, change the exit
#--------------------------------------------------
dt[longEntry == 1 & longExit == 1, longExit := 0]

#-----------------------------------------------------
#assign a group number to consecutive runs of entries
#-----------------------------------------------------
setkeyv(dt,myKeys)
dt[,grp := rleidv(dt, cols="longEntry")] 
dt[,grp1 := paste0(keys_pair,grp)]

dt[, inc := cumsum(longEntry),by=grp1]
dt[, inc1 := cumsum(inc),by=grp1]

#--------------------------------------
# the first entry in the run is a BUY
#--------------------------------------
dt[,buy := 0]
dt[longEntry==1 & inc1 == 1,buy :=1 , by=grp1]

#-----------------------------------------------------
#assign a group number to consecutive runs of exits
#-----------------------------------------------------
dt[,grp := rleidv(dt, cols="longExit")] 
dt[,grp1 := paste0(keys_pair,grp)]

dt[, inc := cumsum(longExit),by=grp1]
dt[, inc1 := cumsum(inc),by=grp1]

#-------------------------------------
# The first exit in the run is a SELL
#-------------------------------------
dt[,sell := 0]
dt[longExit==1 & inc1 == 1,sell :=1 , by=grp1]


#---------------------------------------------------
# Keep just the rows where we are BUYING or SELLING
#---------------------------------------------------
d1 <- subset(dt,(sell==1 & buy==0 | sell==0 & buy==1),select = c('keys_pair','minutesSinceStart','tradePrice','buy','sell'))

#------------------------------------------
# Make sure each buy is followed by a sell
#------------------------------------------
d1[,prevRowBuy :=  shift(buy, 1, type="lag") , by=keys_pair]
d1[,prevRowSell :=  shift(sell, 1, type="lag") , by=keys_pair]
d1 <- subset(d1,prevRowBuy != buy & prevRowSell != sell)

#--------------------------------------------------------------
#find the next price (for buy rows it will be the sell price)
#--------------------------------------------------------------
d1[,sellPrice := shift(tradePrice, 1, type="lead"), by=keys_pair]
d1[,minutesSinceStartExit := shift(minutesSinceStart, 1, type="lead"), by=keys_pair]
d1[, tradeLength := minutesSinceStartExit - minutesSinceStart]


#----------------------------------------------------------------
# now we only need the buy rows as we have the sell time & price
#----------------------------------------------------------------
d1 <- subset(d1,buy==1 & !is.na(minutesSinceStartExit))


#-------------------------------------------------------------------
# long profit estimate (we only have prices for some pairs though)
# this does not include a stoploss calculation
#--------------------------------------------------------------------
d1[,tradeProfit := ( (sellPrice * (1 - feePerc)) - (tradePrice * (1 + feePerc)) )/ tradePrice]

hist(d1$tradeProfit,breaks=100,col='blue');abline(v=0,col='red')


#------------------------------
# generate a submission file 
#------------------------------
stratName <- 'demo_LongOnly'
d1[,enterTime := minutesSinceStart]
d1[,exitTime := minutesSinceStartExit]
d1[,direction := 1]
d1[,percentPair := 1]
d1[,strategyName := stratName]

d1 <- subset(d1,select = c('keys_pair','enterTime','exitTime','direction','percentPair','strategyName'))

fwrite(d1,paste0(dataFolder,stratName,'.csv'))

Buy Low, Sell High - DATA and Code

2019

<<previous post next post>>

This is information about the 2019 Melbourne Datathon - Please read the previous post first

The data is a single file and can be downloaded from the following links:

https://drive.google.com/open?id=11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR (553mb zipped)

https://drive.google.com/open?id=1WuFFk7NkBgYJxgDx8jeohwiLwwOMzLZP (1.5gb unzipped)

The column names should be self explanatory. There are 2 price columns;

barClosePrice - this is the price at the current time stamp 'minutesSinceStart'. It is the price which the predictions (LPred1b-LPred14b) are predicting the change of.

tradePrice - this is the price 5 minutes later. It is the price that will be used in the returns calculations. The reason for this is that we are assuming it takes 5 minutes from the time we decide to trade to actually executing the trade.

Below us some R code to get you going, and we'll be posting more in the coming days.

#--------------------------------------------------------------------------------------
# some R code to get you started with the 2019 Melbourne Datathon analytic challenge
# download the data from the link below and unzip to 'dataFolder'
# https://drive.google.com/file/d/11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR (553 mb)
#--------------------------------------------------------------------------------------

library(data.table)

#define where the data is
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,"melbdatathon2019_buylowsellhigh.csv")

#read in the data
dt <- fread(theDataFile)

nrow(dt)
#4,907,361

colnames(dt)
# [1] "keys_pair"         "minutesSinceStart" "gap"               "barClosePrice"     "tradePrice"        "Lpred1b"           "Lpred2b"           "Lpred3b"           "Lpred4b"          
#[10] "Lpred5b"           "Lpred6b"           "Lpred7b"           "Lpred8b"           "Lpred9b"           "Lpred10b"          "Lpred11b"          "Lpred12b"          "Lpred13b"         
#[19] "Lpred14b" 


unique(dt$keys_pair)
#[1] "0x_bitcoin"           "bitcoin_usdollar"     "bitcoincash_usdollar" "cardano_bitcoin"      "dash_usdollar"        "litecoin_bitcoin"     "litecoin_tetherusd"   "monero_bitcoin"      
#[9] "qtum_bitcoin"         "ripple_bitcoin"       "ripple_usdollar"      "stratis_bitcoin"      "tron_tetherusd"       "zcash_bitcoin"        "pair_1"               "pair_2"              
#[17] "pair_3"               "pair_4"               "pair_5"               "pair_6"               "pair_7"               "pair_8"               "pair_9"               "pair_10"             
#[25] "pair_11"              "pair_12"              "pair_13"              "pair_14"              "pair_15"              "pair_16"              "pair_17"              "pair_18"             
#[33] "pair_19"              "pair_20"              "pair_21"              "pair_22"       


#price information is missing for the unnamed pairs
x1 <- subset(dt,keys_pair== "pair_1")
nrow(x1)
#132,721
summary(x1)


#distribution of the predictions
hist(dt$Lpred1b,breaks=100)

#time series of prices and predictions
x <- subset(dt,keys_pair== "bitcoin_usdollar")[1:1000]
plot(x$minutesSinceStart,x$barClosePrice,type='l',col="blue")
plot(x$minutesSinceStart,x$Lpred1b,type='l',col="blue")

'Buy Low Sell High' 2019 Melbourne Datathon - INTRODUCTION

"Buy Low - Sell High"

2019

next post >>

The analytics challenge for the 2019 Melbourne Datathon

Scenario

You are a trader for a pension fund that makes money for it's investors by constantly buying and selling financial instruments such as shares in companies, foreign currencies or futures contracts. The general idea is that you aim to buy when the price is low and sell when it is high, hence making a profit.

Traditionally you'd stare at the live prices on a screen and through your experience know exactly when to trade. Recently your company has employed a team of machine learning experts to predict if prices are going up or down so as to hopefully remove the guess work form your trading decisions.

Every 5 minutes you are now receiving 14 different predictions from mathematical models. What they forgot to tell you is how to actually use those predictions.

The challenge is to figure out how to use these predictions to determine trading decisions. The winner of the challenge will be the 'trader' who's 'strategy' makes the most money.

The Data

You have been provided with over 12 months of actual prices and predictions of future price movements at 5 minute intervals for some crypto-currency pairs. This is the data you will use to develop your 'trading strategy'.

Additionally you have also been provided with just the predictions (no prices) for some further currency pairs over the same time period. This is the data on which you will apply your strategy to come up with some 'buy' and 'sell' times that determine a 'trade'.

These times will be submitted and the returns calculated.

What does a 'strategy' look like?

A trading strategy uses only the predictions to make 'entry' and 'exit' trading decisions. It can be very simple logic or sophisticated logic - it is the challenge to come up with that logic.

e.g
If prediction4 > 0.4 & prediction7 > 0.2 then 'Enter Long' (buy)
If prediction4 < 0.3 & prediction7 < 0.2 then 'Exit Long' (sell)

Code for an example strategy will be supplied

What does a Submission look like?

You will be submitting a file with 6 columns. Each row of the file will be a trade that will have it's return calculated.

keys_pair: the pair to trade
enterTime: the time to enter the trade
exitTime: the time to exit the trade
direction: are we going long (1) or short (0)
percentPair : what percent of the 'pair balance' do we want to put on this trade
strategyName: what are you calling this strategy (for reference)

The times we supply are just integers which represent 'minutes from start', hence no need to mess around with time stamps.

Direction means are we predicting the price will go up ('going long') or down ('short selling').

percentPair is the percent of the pair 'balance' that we want to put on each trade. There can be multiple trades for a single pair that overlap in time, but the sum of percentPair for that time stamp cannot be more than 1 (for each direction - see below*).

Long and Short?

It's possible to still make money if we think the price is going down, which is called 'short selling'. This is how it's done,

Say X are 10 shares in IBM and we think the share price is dropping.

Rather than buy X, we borrow X from a stockbroker and immediately sell it on the stock exchange. When the price drops we buy X back from the exchange and return X to the stockbroker. We have made a profit as we sold X for more than we bought it back. We would pay the stockbroker 'interest' out of our profit.

'going long' is simpler as we expect the price to go up, so the stockbroker does not need to be involved. We simple 'buy low and sell high' directly from the stock exchange.

*For this exercise, Long and Short trade directions are evaluated separately, which means you will get a total return for long trades and a total return for short trades. Each submitted trades file can contain just longs, just shorts, or both longs and shorts.

How is the Strategy Evaluated?

The general idea of how the strategies are evaluated are it is just a sum of the individual percentage gains of each trade. There is no re-investment of any profit.

for each trade we determine the enter (EN) and exit (EX) prices
the enter and exit prices are adjusted for the trading fee percentage (F), and then the trade percentage gain (TPG) calculated as a percentage of the entry price

long: ( (EX*(1-F)) - (EN*(1+F)) )/ EN

short: ( (EN*(1-F)) - (EX*(1+F)) )/ EN
The percent weighting is applied
TPG = TPG * percentPair
TPG is capped
It is determined if a stoploss has occurred
TPG then summed over all trades to give a total return

Fees?

There is a cost associated with each trade made, called the fee. This means that we don't quite get the actual price price as the exchange takes their cut. When we want to buy, we pay slightly more, and when we want to sell we get slightly less. The fee is applied to the price on entering and exiting a trade.

Hence when we make a trade, we need to be confident the returns will cover the fees we are charged. For this exercise, the fee F = 0.15%.

Capping?

We need to have confidence that the return of each strategy is not influenced by one or two 'lucky' large trades. To do this we take all raw TPGs and calculate the mean (M) and standard deviation (SD) of these individual gains. Any trade that has a return greater than M + (2*SD) or less than M - (2*SD) is capped at that value. This has the effect of removing extreme values.

Stop Losses?

If the price goes too far in the wrong direction then a 'stop loss' will be executed for that trade. For example, if we are 'going long' (hoping the price will go up) but the price falls by more than X% before the trade exit time is reached, then the trade will be immediately exited for you. This trade will use the price at the time the stop loss was triggered to calculate the trade returns.

For this exercise the stop loss value is 5%

Are there any other trade restrictions?

trade length

A 'dumb' strategy would just be to do a single trade for each currency pair that lasts the whole of the time period and guess if you think think it should be a long (price going up) or a short (price going down). Such a strategy might end up making a profit but it would be just guessing and not using the predictions. To prevent this, any trades that are longer than 4 days (60 x 24 x 4 minutes) in length will be eliminated.

number of trades

It's also possible if you think the predictions are no good to just submit zero trades and have a return of 0, hoping everyone else's strategy loses money. Hence all currency pairs must have at least 5 trades each.

How do we Submit our Trades?

Once you have created a trades file, you will upload it via our web portal. Each team can only upload 1 file every 6 hours, so be prudent. Each file can only contain 1 long strategy and 1 short strategy. Make sure your file names for each submission are different and that each strategy has a different name.

Several times a day we will calculate the returns and display them along with other metrics on the leader board.

You must submit trades for all the pairs in the data set. These will be divided into 3 groups and returns calculated for each group.

Train group - these are the currency pairs that have real names and prices. Use this set to make sure your calculations of the returns match ours.
Public group - this will be a subset of the remaining currency pairs (which have masked names and no prices). You will be able to see the returns for these pairs.
Private group - the remaining currency pairs. You won't be able to see the returns until the competition is over.

Who Wins?

In order to win there are several criteria:

The top 3 teams whose combined best long and best short returns on the private group exceed those of the benchmark solution will be qualify for the next stage.

We will then release a further data set. We will need to witness your code crunching this new data to generate the predictions file.

Note: we don't need the code itself or the logic - that is your IP. We just need to see stuff happening to make sure there has been no dodgy business occurring!

It is the returns on this set that will determine the grand champion, so long as they still beat the benchmark solution

What do we Win?

There will be a cash prize for the top 3 teams beating the benchmark solution. Any team or individual may be asked to come and develop their strategy for a real trading company, so there is a good chance that this could lead to a job and wealth directly correlated to the performance of your solution!

Skills, Team size etc...

In order to do well in this challenge, you must have a couple of skills;

the ability to think logically to come up with ideas
the ability to write code to test those ideas

It will not be advantageous to form teams from the start. The benefit of team forming will come later on when you want to combine ideas.

Can you tell us more about the predictions?

There are 14 predictions for each currency pair for each time stamp. The predictions are weakly correlated to the actual percentage change in price at some point in the future. Prediction1 is the shortest term prediction (1 hour ahead) and prediction14 is a longest term prediction (96 hours ahead), with the remaining being sequential in their prediction horizon.

The predictions are simulated but do represent what can be achieved. We cannot divulge any further information about the predictions.

Where is the Data?

See our next post for information on the data and some sample code.