Tuesday, 27 August 2019

Buy Low Sell High - a simple strategy


2019

 << previous post (getting the data)    next>>

Below is some R code to generate a valid submission file for a very simple strategy.

Is up to you to find the strategies that will give the highest  returns.

In the next post we'll introduce the leaderboard where you can upload your strategies to.



############################################################
#
# sample R code for the 2019 Melbourne Datathon that will 
# generate a valid submission file.
#
############################################################



library(data.table)


#----------------------------------------------
#define where the data is & load
#----------------------------------------------
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,"melbdatathon2019_buylowsellhigh.csv")
dt <- fread(theDataFile)


#-------------------------
# strategy settings
#-------------------------
feePerc <- 0.0015 #fixed const for the competition
longThresh <- 0.002 #the value the prediction has to be higher than
longRemainWindow <- 12 #the number of hours to remain in a trade


#---------------------------------------------------
# flag if the criteria for entry is passed
#---------------------------------------------------
dt[,longEntry := 0]
dt[Lpred7b >= longThresh, longEntry := 1]

#----------------------
# set the exit times
#----------------------
exitLongs <- subset(dt,longEntry == 1,select = c('keys_pair','minutesSinceStart'))
exitLongs[,minutesSinceStart := minutesSinceStart + (longRemainWindow * 60)]
exitLongs[,longExit := 1]

#-------------------------------
#merge the entry and exit times
#-------------------------------
myKeys <-  c("keys_pair","minutesSinceStart")
setkeyv(dt,myKeys)
setkeyv(exitLongs,myKeys)
dt <- merge(dt,exitLongs,all.x=TRUE)

#-----------------------------
# no match = no exit 
#-----------------------------
dt[is.na(longExit), longExit := 0]

#--------------------------------------------------
# if there is an exit and enter, change the exit
#--------------------------------------------------
dt[longEntry == 1 & longExit == 1, longExit := 0]

#-----------------------------------------------------
#assign a group number to consecutive runs of entries
#-----------------------------------------------------
setkeyv(dt,myKeys)
dt[,grp := rleidv(dt, cols="longEntry")] 
dt[,grp1 := paste0(keys_pair,grp)]

dt[, inc := cumsum(longEntry),by=grp1]
dt[, inc1 := cumsum(inc),by=grp1]

#--------------------------------------
# the first entry in the run is a BUY
#--------------------------------------
dt[,buy := 0]
dt[longEntry==1 & inc1 == 1,buy :=1 , by=grp1]

#-----------------------------------------------------
#assign a group number to consecutive runs of exits
#-----------------------------------------------------
dt[,grp := rleidv(dt, cols="longExit")] 
dt[,grp1 := paste0(keys_pair,grp)]

dt[, inc := cumsum(longExit),by=grp1]
dt[, inc1 := cumsum(inc),by=grp1]

#-------------------------------------
# The first exit in the run is a SELL
#-------------------------------------
dt[,sell := 0]
dt[longExit==1 & inc1 == 1,sell :=1 , by=grp1]


#---------------------------------------------------
# Keep just the rows where we are BUYING or SELLING
#---------------------------------------------------
d1 <- subset(dt,(sell==1 & buy==0 | sell==0 & buy==1),select = c('keys_pair','minutesSinceStart','tradePrice','buy','sell'))

#------------------------------------------
# Make sure each buy is followed by a sell
#------------------------------------------
d1[,prevRowBuy :=  shift(buy, 1, type="lag") , by=keys_pair]
d1[,prevRowSell :=  shift(sell, 1, type="lag") , by=keys_pair]
d1 <- subset(d1,prevRowBuy != buy & prevRowSell != sell)

#--------------------------------------------------------------
#find the next price (for buy rows it will be the sell price)
#--------------------------------------------------------------
d1[,sellPrice := shift(tradePrice, 1, type="lead"), by=keys_pair]
d1[,minutesSinceStartExit := shift(minutesSinceStart, 1, type="lead"), by=keys_pair]
d1[, tradeLength := minutesSinceStartExit - minutesSinceStart]


#----------------------------------------------------------------
# now we only need the buy rows as we have the sell time & price
#----------------------------------------------------------------
d1 <- subset(d1,buy==1 & !is.na(minutesSinceStartExit))


#-------------------------------------------------------------------
# long profit estimate (we only have prices for some pairs though)
# this does not include a stoploss calculation
#--------------------------------------------------------------------
d1[,tradeProfit := ( (sellPrice * (1 - feePerc)) - (tradePrice * (1 + feePerc)) )/ tradePrice]

hist(d1$tradeProfit,breaks=100,col='blue');abline(v=0,col='red')


#------------------------------
# generate a submission file 
#------------------------------
stratName <- 'demo_LongOnly'
d1[,enterTime := minutesSinceStart]
d1[,exitTime := minutesSinceStartExit]
d1[,direction := 1]
d1[,percentPair := 1]
d1[,strategyName := stratName]

d1 <- subset(d1,select = c('keys_pair','enterTime','exitTime','direction','percentPair','strategyName'))

fwrite(d1,paste0(dataFolder,stratName,'.csv'))

25 comments:

  1. Hi,

    would you be able to provide the sample code in python as well?

    Thanks

    ReplyDelete
  2. Sorry, no. You are the data scientists so you should be able to do that yourself.
    If anyone want's to post some python equivalent then that may be appreciated by others.

    ReplyDelete
  3. So I have a problem with this type of submission:

    The Lpred7b variable is on a different scale for each pair, so the threshold is likely to not be comparable across them. This poses a problem with the new data is given in the final evaluation round - How can any trading logic be built on the provided trading pairs if the magnitude of these predictions are not transferable to new data?

    If this isn't clear enough, let me demonstrate my concern with a simple example. I might have a model of the form:

    Value = 0.5 * Pred1 + 0.3 * Pred2

    Then use this "Value" variable to determine my trading logic. If the new pairs in the new dataset are of a different scale, then Pred1 and Pred2 for the newly provided pairs cannot be used in my pre-trained model above, as they relate to different variables that just so happen to have the same name.

    Ultimately this leads to the question -> will the closing prices and predicted fields of the TRAINING pairs also be included on the new data given at the end so that my built model can actually be used.

    ReplyDelete
  4. Hi Sam,

    If the predictions appear to be on a different scale then it's probably because the thing we are predicting varies between pairs. If a pair never changed in price then our predictions would hopefully not change much either. Another pair might be more volatile and hopefully our predictions would match this. We would rather trade this latter pair.

    Hope this makes sense?

    ReplyDelete
  5. it does make sense, except the learned model in the training phase is learned on the magnitudes of only the pairs its seen. Further to that, a model might make use of the price action of Bitcoin_USD (like a 14 day moving average) but I'm not sure if this can even be done because if the new data doesn't contain the closing prices of Bitcoin_USD, then my model wont work on the new data provided at the end.

    There is a lot of confusion in regards to how all of this runs, which risks making a lot of peoples strategies not being applicable on the final dataset. I definitely think further clarification on what can and can't be used on the unseen data and what will and will not be given would be very helpful to a lot of people. I'm surprised theres not as many questions on this as there should be.

    ReplyDelete
    Replies
    1. https://anotherdataminingblog.blogspot.com/2019/08/buy-low-sell-high.html


      What does a 'strategy' look like?

      A trading strategy uses only the predictions to make 'entry' and 'exit' trading decisions.

      Delete
    2. we'll add some clarification in the next post so all is clear.

      Delete
    3. It means we just need to rely on pred 1-14 to find trade decisions? Maybe i am lost how can we make trade decisions based on these numbers of Pred 1-14 only?

      Delete
    4. Yes, that is the challenge. Preds1-14 are predictions that have already been made for you. You need to figure out how to use those predictions to make trading decisions.

      Delete
    5. Just FYI, its not possible to make a profitable trading strategy using only those predictions. Any profitable strategy would be due to luck only, and would not generate a positive return over an infinite horizon.

      Delete
    6. Agreed Sam , we dont have past decision as well or i am misinterpreting data

      Delete
    7. You don't need past predictions to trade this. Provided are future predictions and you have the price action prior to those predictors. One way to frame the problem is to determine if there is any signal in the future predictors that lead to a positive outcome given a short or long trade. The difficult part will be identifying which strategy applies to the unlabeled pairs.

      Delete
    8. This comment has been removed by the author.

      Delete
    9. Yes that is the correct way to frame the problem. The issue is, there is not enough of a signal in the provided features to produce a profitable strategy. Not even with a 0% transaction fee and even if you know the pair. It's unfortunate because I think the idea for this Datathon is great, the data given is simply not useful for the problem, all strategies will be pure luck no matter how creative you get with the 14 numbers.

      Delete
    10. We will soon find out :)

      Delete
  6. that make sense , we don't have any past decision in given file how will i teach my model what to take decision based on pre 1-14? Thanks for update

    ReplyDelete
    Replies
    1. That is for you to try to figure out. This is a real problem that we are hoping you will come up with an innovative solution for.

      Delete
    2. This comment has been removed by the author.

      Delete
  7. Hi Phil,

    What does mean by longThresh <- 0.002 #the value the prediction has to be higher than?

    does it mean we have to work only on those data which has more than 0.002 predictions of each 14?

    ReplyDelete
  8. it's setting a parameter value, so whenever in the rest of the code you see 'longThresh', substitute 0.002.

    ReplyDelete
  9. You made an article that is interesting.You are so amazing.
    Thanks a million and please keep up the effective work. clipping photoshop

    ReplyDelete
  10. Writing or rewriting a script can be a long and lengthy task but, if you enjoy being creative and you don't mind the time it takes to write or rewrite a script,

    ReplyDelete
  11. Thanks for every other informative site. The place else may just I get that kind of information written in such an ideal means? I have a venture that I’m just now operating on, and I have been on the look out for such information. buy real instagram followers

    ReplyDelete