Another Data Mining Blog

Sunday, 20 October 2019

Milestone Prize Solution

2019
<< next previous >>

Congratulations to Samuel Messina for winning the $1,000 milestone prize. The write-up of his technique can be found here.

Sunday, 15 September 2019

Buy Low Sell High - leaderboard and rule clarification

2019
<<Previous Next>>

You can now register your team and submit your trades files to see how you do on the leader board:

https://melbournedatathon.com

Just to clarify a few things:

You should only use the values of the current and previous predictions to make a decision
If you use future prediction values then you will probably get a good result but this logic is unimplementable
Any strategies found to be using future prediction values will be disqualified, so there is no point in trying
Do not use anything else in your algorithm such as pair name, price or absolute time
If you want to use previous prediction values, then use the relative time differences to determine them
The new file we provide will be consistent in having predictions generated every 5 minutes intervals, but the absolute values could be anything
After the deadline, you will be asked to nominate your 3 long and 3 short strategies you want to be evaluated
All teams beating the Benchmark solution on the private leader board for their nominated strategies will then qualify for stage 2
We will then invite those teams to run their code over several new files. They must only use the strategies nominated and the strategies must be locked in with no further parameter tuning allowed
The new files will have different pair names and the start time for the field minutesSinceStart will not necessarily be the same start time as the file already provided
The winners will be the team that gives the best return on the new data providing they still beat the Benchmark and we are confident no future prediction values have been used.
There will be a winners for Short and a winners for Long
If we suspect future prediction are being used then we will say how we came to this conclusion and the team will have a right of reply to prove otherwise
There will be a benchmark prize for the first leading team on the private leaderboard as at 12 pm on Thu Oct 17th that wishes to reveal their method. In order to receive the prize the team must write a blog post describing their method so it can be reproduced by others. It must not use future information. It is not compulsory to reveal your method, so we will proceed down the ranking and award the prize to the first team that wishes to do so.

Just to clarify what we mean by 'future information'. The data set does contain records that are 'in the future' to the times we have asked you to make decisions for. It is OK to use this data to come up with a set of coefficients for a model.
What it is NOT OK to do is use the raw prediction values at time 'x' as inputs to a model making a decision for a time prior to 'x'. This is an unimplementable solution.

If there is anything else that needs clarifying, ask below and we will add to the list

Good Luck

Thursday, 5 September 2019

Buy Low Sell High - file checking

2019
<<previous next>>

In the previous post there was code to generate a submission file.

Submission files need to be valid so we can process them, otherwise they will be rejected by the calculation engine. Below is some code which your file should pass to ensure it will not be rejected.

#----------------------------------------------------------
#
# some R code to test your submission file is valid
#
#----------------------------------------------------------

library(data.table)

#define where the data is
fileToTest <- 'test_submission_file_1.csv'


dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,fileToTest)

#read in the data
tradeData <- fread(theDataFile)

#set error message
ermsg <- 'OK'
  
  #these are the column names required
  requiredCols <- c('strategyName','keys_pair','enterTime','exitTime','direction','percentPair')
  
  #------------------------
  #do some basic checks
  #-----------------------
  if (!identical(sort(intersect(colnames(tradeData),requiredCols)) , sort(requiredCols))) {
    missing_cols <- setdiff(requiredCols,colnames(tradeData))
    ermsg <- paste('\n incorrect column names in the trades file. We are mising:\n',paste(missing_cols,collapse = ","))
  } else if (class(tradeData$enterTime) == "character") {
    ermsg <- '\n the field enterTime contains non numeric values'
  } else if (class(tradeData$exitTime) == "character") {
    ermsg <- '\n the field exitTime contains non numeric values'
  } else if (class(tradeData$direction) == "character") {
    ermsg <- '\n the field direction contains non numeric values'
  } else if (class(tradeData$percentPair) == "character") {
    ermsg <- '\n the field percentPair contains non numeric values'
  } else if (nrow(tradeData) == 0){
    ermsg <- '\n the file contains no records'
  } else if(length(which(is.na(tradeData)==TRUE)) > 0) {
    ermsg <- '\n the file contains missing values' 
  } else if (min(tradeData$exitTime - tradeData$enterTime) <= 0) {
    ermsg <- '\n the exit time needs to be after the entry time'
  }
  

  if (ermsg != 'OK'){
    stop(ermsg)
  }


  #-------------------------------------
  # check that we are not over 100% in 
  # a trade at any particualr time
  #-------------------------------------
  timeStep <- 5
  
  tradeData[,xKey := paste(keys_pair,strategyName,direction,sep="_")]
  allKeys <- unique(tradeData$xKey)
  
  for (thisKey in allKeys){
    
    pairStrategyTradeData <- subset(tradeData,xKey == thisKey)
    E_numberOfTrades <- nrow(pairStrategyTradeData)
    
    cat("\n",thisKey,E_numberOfTrades)
    
    #get the times we are in a trade
    for (tradeNumber in 1:E_numberOfTrades){
      thisTimes1 <- seq(from = pairStrategyTradeData[tradeNumber,enterTime],to=pairStrategyTradeData[tradeNumber,exitTime- timeStep],by=timeStep )
      thisPercent1 <- rep(pairStrategyTradeData[tradeNumber,percentPair],length(thisTimes1))
      if (tradeNumber==1){
        allTimes1 <- thisTimes1
        allPercent1 <- thisPercent1
      } else {
        allTimes1 <- c(allTimes1,thisTimes1)
        allPercent1 <- c(allPercent1,thisPercent1)
      }
    } #tradeNumber
    
    #make sure the max percent at any one time is not > 1
    d <- cbind.data.frame(allTimes1,allPercent1)
    posPercents <- aggregate(allPercent1 ~ allTimes1,data=d,sum)$allPercent
    
    maxPercent <- max(posPercents)
    if (maxPercent > 1){
      aBadTime <- d[which(aggregate(allPercent1 ~ allTimes1,data=d,sum)$allPercent > 1)[1],'allTimes1']
      ermsg <- paste("\n you cannot be more than 100% in a trade at any one time:\n",thisKey,"\ntime=",aBadTime,"\npercent =",maxPercent)
      stop(ermsg)
    }
    
  } #  thisKey

  cat('\nthe file appears to be OK!')

Update 17th Oct 2019.
The file validator is now more strict. The basic checks section in the above code has now expand to the following set of rules:




requiredCols <- c('strategyName','keys_pair','enterTime','exitTime','direction','percentPair')
 
requiredPairs <- paste0('pair_',1:22)
 

 
#do some checks
 
if (!identical(sort(intersect(colnames(tradeData),requiredCols)) , sort(requiredCols))) {
 
missing_cols <- setdiff(requiredCols,colnames(tradeData))
 
ermsg <- paste('\n incorrect column names in the trades file. We are mising:\n',paste(missing_cols,collapse = ","))
 
} else if (class(tradeData$enterTime) == "character") {
 
ermsg <- '\n the field enterTime contains non numeric values'
 
} else if (class(tradeData$exitTime) == "character") {
 
ermsg <- '\n the field exitTime contains non numeric values'
 
} else if (class(tradeData$direction) == "character") {
 
ermsg <- '\n the field direction contains non numeric values'
 
} else if (class(tradeData$percentPair) == "character") {
 
ermsg <- '\n the field percentPair contains non numeric values'
 
} else if (nrow(tradeData) == 0){
 
ermsg <- '\n the file contains no records'
 
} else if(length(which(is.na(tradeData)==TRUE)) > 0) {
 
ermsg <- '\n the file contains missing values'
 
} else if (min(tradeData$exitTime - tradeData$enterTime) <= 0) {
 
ermsg <- '\n the exit time needs to be after the entry time'
 
} else if (nrow(tradeData) < 50){
 
ermsg <- '\n not enough trades'
 
} else if (max(tradeData$direction) > 1){
 
ermsg <- '\n trade direction > 1'
 
} else if (min(tradeData$direction) < 0){
 
ermsg <- '\n trade direction < 0'
 
} else if (length(unique(tradeData$direction)) > 2){
 
ermsg <- '\n too many trade directions'
 
} else if (max(tradeData$enterTime %% 5) > 0) {
 
ermsg <- '\n wrong times'
 
} else if (max(tradeData$exitTime %% 5) > 0) {
 
ermsg <- '\n wrong times'
 
} else if (min(tradeData$percentPair) < 0) {
 
ermsg <- '\n negative percent pairs'
 
} else if ( length(intersect(unique(tradeData$keys_pair),requiredPairs)) == 0 ){
 
ermsg <- '\n you need to predict for all pairs'
 
} else if ( !identical(sort(intersect(unique(tradeData$keys_pair),requiredPairs)),sort(requiredPairs))){
 
ermsg <- '\n you need to predict for all pairs'
 
} else if (min(tradeData[,.N,by = c('keys_pair','direction')]$N) < 5) {
 
ermsg <- '\n you need at least 5 trades pairs per '
 
}