Another Data Mining Blog: August 2019

Tuesday, 27 August 2019

Buy Low Sell High - a simple strategy

2019

<< previous post (getting the data) next>>

Below is some R code to generate a valid submission file for a very simple strategy.

Is up to you to find the strategies that will give the highest returns.

In the next post we'll introduce the leaderboard where you can upload your strategies to.


############################################################
#
# sample R code for the 2019 Melbourne Datathon that will 
# generate a valid submission file.
#
############################################################



library(data.table)


#----------------------------------------------
#define where the data is & load
#----------------------------------------------
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,"melbdatathon2019_buylowsellhigh.csv")
dt <- fread(theDataFile)


#-------------------------
# strategy settings
#-------------------------
feePerc <- 0.0015 #fixed const for the competition
longThresh <- 0.002 #the value the prediction has to be higher than
longRemainWindow <- 12 #the number of hours to remain in a trade


#---------------------------------------------------
# flag if the criteria for entry is passed
#---------------------------------------------------
dt[,longEntry := 0]
dt[Lpred7b >= longThresh, longEntry := 1]

#----------------------
# set the exit times
#----------------------
exitLongs <- subset(dt,longEntry == 1,select = c('keys_pair','minutesSinceStart'))
exitLongs[,minutesSinceStart := minutesSinceStart + (longRemainWindow * 60)]
exitLongs[,longExit := 1]

#-------------------------------
#merge the entry and exit times
#-------------------------------
myKeys <-  c("keys_pair","minutesSinceStart")
setkeyv(dt,myKeys)
setkeyv(exitLongs,myKeys)
dt <- merge(dt,exitLongs,all.x=TRUE)

#-----------------------------
# no match = no exit 
#-----------------------------
dt[is.na(longExit), longExit := 0]

#--------------------------------------------------
# if there is an exit and enter, change the exit
#--------------------------------------------------
dt[longEntry == 1 & longExit == 1, longExit := 0]

#-----------------------------------------------------
#assign a group number to consecutive runs of entries
#-----------------------------------------------------
setkeyv(dt,myKeys)
dt[,grp := rleidv(dt, cols="longEntry")] 
dt[,grp1 := paste0(keys_pair,grp)]

dt[, inc := cumsum(longEntry),by=grp1]
dt[, inc1 := cumsum(inc),by=grp1]

#--------------------------------------
# the first entry in the run is a BUY
#--------------------------------------
dt[,buy := 0]
dt[longEntry==1 & inc1 == 1,buy :=1 , by=grp1]

#-----------------------------------------------------
#assign a group number to consecutive runs of exits
#-----------------------------------------------------
dt[,grp := rleidv(dt, cols="longExit")] 
dt[,grp1 := paste0(keys_pair,grp)]

dt[, inc := cumsum(longExit),by=grp1]
dt[, inc1 := cumsum(inc),by=grp1]

#-------------------------------------
# The first exit in the run is a SELL
#-------------------------------------
dt[,sell := 0]
dt[longExit==1 & inc1 == 1,sell :=1 , by=grp1]


#---------------------------------------------------
# Keep just the rows where we are BUYING or SELLING
#---------------------------------------------------
d1 <- subset(dt,(sell==1 & buy==0 | sell==0 & buy==1),select = c('keys_pair','minutesSinceStart','tradePrice','buy','sell'))

#------------------------------------------
# Make sure each buy is followed by a sell
#------------------------------------------
d1[,prevRowBuy :=  shift(buy, 1, type="lag") , by=keys_pair]
d1[,prevRowSell :=  shift(sell, 1, type="lag") , by=keys_pair]
d1 <- subset(d1,prevRowBuy != buy & prevRowSell != sell)

#--------------------------------------------------------------
#find the next price (for buy rows it will be the sell price)
#--------------------------------------------------------------
d1[,sellPrice := shift(tradePrice, 1, type="lead"), by=keys_pair]
d1[,minutesSinceStartExit := shift(minutesSinceStart, 1, type="lead"), by=keys_pair]
d1[, tradeLength := minutesSinceStartExit - minutesSinceStart]


#----------------------------------------------------------------
# now we only need the buy rows as we have the sell time & price
#----------------------------------------------------------------
d1 <- subset(d1,buy==1 & !is.na(minutesSinceStartExit))


#-------------------------------------------------------------------
# long profit estimate (we only have prices for some pairs though)
# this does not include a stoploss calculation
#--------------------------------------------------------------------
d1[,tradeProfit := ( (sellPrice * (1 - feePerc)) - (tradePrice * (1 + feePerc)) )/ tradePrice]

hist(d1$tradeProfit,breaks=100,col='blue');abline(v=0,col='red')


#------------------------------
# generate a submission file 
#------------------------------
stratName <- 'demo_LongOnly'
d1[,enterTime := minutesSinceStart]
d1[,exitTime := minutesSinceStartExit]
d1[,direction := 1]
d1[,percentPair := 1]
d1[,strategyName := stratName]

d1 <- subset(d1,select = c('keys_pair','enterTime','exitTime','direction','percentPair','strategyName'))

fwrite(d1,paste0(dataFolder,stratName,'.csv'))

Buy Low, Sell High - DATA and Code

2019

<<previous post next post>>

This is information about the 2019 Melbourne Datathon - Please read the previous post first

The data is a single file and can be downloaded from the following links:

https://drive.google.com/open?id=11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR (553mb zipped)

https://drive.google.com/open?id=1WuFFk7NkBgYJxgDx8jeohwiLwwOMzLZP (1.5gb unzipped)

The column names should be self explanatory. There are 2 price columns;

barClosePrice - this is the price at the current time stamp 'minutesSinceStart'. It is the price which the predictions (LPred1b-LPred14b) are predicting the change of.

tradePrice - this is the price 5 minutes later. It is the price that will be used in the returns calculations. The reason for this is that we are assuming it takes 5 minutes from the time we decide to trade to actually executing the trade.

Below us some R code to get you going, and we'll be posting more in the coming days.

#--------------------------------------------------------------------------------------
# some R code to get you started with the 2019 Melbourne Datathon analytic challenge
# download the data from the link below and unzip to 'dataFolder'
# https://drive.google.com/file/d/11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR (553 mb)
#--------------------------------------------------------------------------------------

library(data.table)

#define where the data is
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,"melbdatathon2019_buylowsellhigh.csv")

#read in the data
dt <- fread(theDataFile)

nrow(dt)
#4,907,361

colnames(dt)
# [1] "keys_pair"         "minutesSinceStart" "gap"               "barClosePrice"     "tradePrice"        "Lpred1b"           "Lpred2b"           "Lpred3b"           "Lpred4b"          
#[10] "Lpred5b"           "Lpred6b"           "Lpred7b"           "Lpred8b"           "Lpred9b"           "Lpred10b"          "Lpred11b"          "Lpred12b"          "Lpred13b"         
#[19] "Lpred14b" 


unique(dt$keys_pair)
#[1] "0x_bitcoin"           "bitcoin_usdollar"     "bitcoincash_usdollar" "cardano_bitcoin"      "dash_usdollar"        "litecoin_bitcoin"     "litecoin_tetherusd"   "monero_bitcoin"      
#[9] "qtum_bitcoin"         "ripple_bitcoin"       "ripple_usdollar"      "stratis_bitcoin"      "tron_tetherusd"       "zcash_bitcoin"        "pair_1"               "pair_2"              
#[17] "pair_3"               "pair_4"               "pair_5"               "pair_6"               "pair_7"               "pair_8"               "pair_9"               "pair_10"             
#[25] "pair_11"              "pair_12"              "pair_13"              "pair_14"              "pair_15"              "pair_16"              "pair_17"              "pair_18"             
#[33] "pair_19"              "pair_20"              "pair_21"              "pair_22"       


#price information is missing for the unnamed pairs
x1 <- subset(dt,keys_pair== "pair_1")
nrow(x1)
#132,721
summary(x1)


#distribution of the predictions
hist(dt$Lpred1b,breaks=100)

#time series of prices and predictions
x <- subset(dt,keys_pair== "bitcoin_usdollar")[1:1000]
plot(x$minutesSinceStart,x$barClosePrice,type='l',col="blue")
plot(x$minutesSinceStart,x$Lpred1b,type='l',col="blue")

'Buy Low Sell High' 2019 Melbourne Datathon - INTRODUCTION

"Buy Low - Sell High"

2019

next post >>

The analytics challenge for the 2019 Melbourne Datathon

Scenario

You are a trader for a pension fund that makes money for it's investors by constantly buying and selling financial instruments such as shares in companies, foreign currencies or futures contracts. The general idea is that you aim to buy when the price is low and sell when it is high, hence making a profit.

Traditionally you'd stare at the live prices on a screen and through your experience know exactly when to trade. Recently your company has employed a team of machine learning experts to predict if prices are going up or down so as to hopefully remove the guess work form your trading decisions.

Every 5 minutes you are now receiving 14 different predictions from mathematical models. What they forgot to tell you is how to actually use those predictions.

The challenge is to figure out how to use these predictions to determine trading decisions. The winner of the challenge will be the 'trader' who's 'strategy' makes the most money.

The Data

You have been provided with over 12 months of actual prices and predictions of future price movements at 5 minute intervals for some crypto-currency pairs. This is the data you will use to develop your 'trading strategy'.

Additionally you have also been provided with just the predictions (no prices) for some further currency pairs over the same time period. This is the data on which you will apply your strategy to come up with some 'buy' and 'sell' times that determine a 'trade'.

These times will be submitted and the returns calculated.

What does a 'strategy' look like?

A trading strategy uses only the predictions to make 'entry' and 'exit' trading decisions. It can be very simple logic or sophisticated logic - it is the challenge to come up with that logic.

e.g
If prediction4 > 0.4 & prediction7 > 0.2 then 'Enter Long' (buy)
If prediction4 < 0.3 & prediction7 < 0.2 then 'Exit Long' (sell)

Code for an example strategy will be supplied

What does a Submission look like?

You will be submitting a file with 6 columns. Each row of the file will be a trade that will have it's return calculated.

keys_pair: the pair to trade
enterTime: the time to enter the trade
exitTime: the time to exit the trade
direction: are we going long (1) or short (0)
percentPair : what percent of the 'pair balance' do we want to put on this trade
strategyName: what are you calling this strategy (for reference)

The times we supply are just integers which represent 'minutes from start', hence no need to mess around with time stamps.

Direction means are we predicting the price will go up ('going long') or down ('short selling').

percentPair is the percent of the pair 'balance' that we want to put on each trade. There can be multiple trades for a single pair that overlap in time, but the sum of percentPair for that time stamp cannot be more than 1 (for each direction - see below*).

Long and Short?

It's possible to still make money if we think the price is going down, which is called 'short selling'. This is how it's done,

Say X are 10 shares in IBM and we think the share price is dropping.

Rather than buy X, we borrow X from a stockbroker and immediately sell it on the stock exchange. When the price drops we buy X back from the exchange and return X to the stockbroker. We have made a profit as we sold X for more than we bought it back. We would pay the stockbroker 'interest' out of our profit.

'going long' is simpler as we expect the price to go up, so the stockbroker does not need to be involved. We simple 'buy low and sell high' directly from the stock exchange.

*For this exercise, Long and Short trade directions are evaluated separately, which means you will get a total return for long trades and a total return for short trades. Each submitted trades file can contain just longs, just shorts, or both longs and shorts.

How is the Strategy Evaluated?

The general idea of how the strategies are evaluated are it is just a sum of the individual percentage gains of each trade. There is no re-investment of any profit.

for each trade we determine the enter (EN) and exit (EX) prices
the enter and exit prices are adjusted for the trading fee percentage (F), and then the trade percentage gain (TPG) calculated as a percentage of the entry price

long: ( (EX*(1-F)) - (EN*(1+F)) )/ EN

short: ( (EN*(1-F)) - (EX*(1+F)) )/ EN
The percent weighting is applied
TPG = TPG * percentPair
TPG is capped
It is determined if a stoploss has occurred
TPG then summed over all trades to give a total return

Fees?

There is a cost associated with each trade made, called the fee. This means that we don't quite get the actual price price as the exchange takes their cut. When we want to buy, we pay slightly more, and when we want to sell we get slightly less. The fee is applied to the price on entering and exiting a trade.

Hence when we make a trade, we need to be confident the returns will cover the fees we are charged. For this exercise, the fee F = 0.15%.

Capping?

We need to have confidence that the return of each strategy is not influenced by one or two 'lucky' large trades. To do this we take all raw TPGs and calculate the mean (M) and standard deviation (SD) of these individual gains. Any trade that has a return greater than M + (2*SD) or less than M - (2*SD) is capped at that value. This has the effect of removing extreme values.

Stop Losses?

If the price goes too far in the wrong direction then a 'stop loss' will be executed for that trade. For example, if we are 'going long' (hoping the price will go up) but the price falls by more than X% before the trade exit time is reached, then the trade will be immediately exited for you. This trade will use the price at the time the stop loss was triggered to calculate the trade returns.

For this exercise the stop loss value is 5%

Are there any other trade restrictions?

trade length

A 'dumb' strategy would just be to do a single trade for each currency pair that lasts the whole of the time period and guess if you think think it should be a long (price going up) or a short (price going down). Such a strategy might end up making a profit but it would be just guessing and not using the predictions. To prevent this, any trades that are longer than 4 days (60 x 24 x 4 minutes) in length will be eliminated.

number of trades

It's also possible if you think the predictions are no good to just submit zero trades and have a return of 0, hoping everyone else's strategy loses money. Hence all currency pairs must have at least 5 trades each.

How do we Submit our Trades?

Once you have created a trades file, you will upload it via our web portal. Each team can only upload 1 file every 6 hours, so be prudent. Each file can only contain 1 long strategy and 1 short strategy. Make sure your file names for each submission are different and that each strategy has a different name.

Several times a day we will calculate the returns and display them along with other metrics on the leader board.

You must submit trades for all the pairs in the data set. These will be divided into 3 groups and returns calculated for each group.

Train group - these are the currency pairs that have real names and prices. Use this set to make sure your calculations of the returns match ours.
Public group - this will be a subset of the remaining currency pairs (which have masked names and no prices). You will be able to see the returns for these pairs.
Private group - the remaining currency pairs. You won't be able to see the returns until the competition is over.

Who Wins?

In order to win there are several criteria:

The top 3 teams whose combined best long and best short returns on the private group exceed those of the benchmark solution will be qualify for the next stage.

We will then release a further data set. We will need to witness your code crunching this new data to generate the predictions file.

Note: we don't need the code itself or the logic - that is your IP. We just need to see stuff happening to make sure there has been no dodgy business occurring!

It is the returns on this set that will determine the grand champion, so long as they still beat the benchmark solution

What do we Win?

There will be a cash prize for the top 3 teams beating the benchmark solution. Any team or individual may be asked to come and develop their strategy for a real trading company, so there is a good chance that this could lead to a job and wealth directly correlated to the performance of your solution!

Skills, Team size etc...

In order to do well in this challenge, you must have a couple of skills;

the ability to think logically to come up with ideas
the ability to write code to test those ideas

It will not be advantageous to form teams from the start. The benefit of team forming will come later on when you want to combine ideas.

Can you tell us more about the predictions?

There are 14 predictions for each currency pair for each time stamp. The predictions are weakly correlated to the actual percentage change in price at some point in the future. Prediction1 is the shortest term prediction (1 hour ahead) and prediction14 is a longest term prediction (96 hours ahead), with the remaining being sequential in their prediction horizon.

The predictions are simulated but do represent what can be achieved. We cannot divulge any further information about the predictions.

Where is the Data?

See our next post for information on the data and some sample code.