Another Data Mining Blog: Buy Low Sell High

Tuesday, 27 August 2019

Buy Low Sell High - a simple strategy

2019

<< previous post (getting the data) next>>

Below is some R code to generate a valid submission file for a very simple strategy.

Is up to you to find the strategies that will give the highest returns.

In the next post we'll introduce the leaderboard where you can upload your strategies to.


############################################################
#
# sample R code for the 2019 Melbourne Datathon that will 
# generate a valid submission file.
#
############################################################



library(data.table)


#----------------------------------------------
#define where the data is & load
#----------------------------------------------
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,"melbdatathon2019_buylowsellhigh.csv")
dt <- fread(theDataFile)


#-------------------------
# strategy settings
#-------------------------
feePerc <- 0.0015 #fixed const for the competition
longThresh <- 0.002 #the value the prediction has to be higher than
longRemainWindow <- 12 #the number of hours to remain in a trade


#---------------------------------------------------
# flag if the criteria for entry is passed
#---------------------------------------------------
dt[,longEntry := 0]
dt[Lpred7b >= longThresh, longEntry := 1]

#----------------------
# set the exit times
#----------------------
exitLongs <- subset(dt,longEntry == 1,select = c('keys_pair','minutesSinceStart'))
exitLongs[,minutesSinceStart := minutesSinceStart + (longRemainWindow * 60)]
exitLongs[,longExit := 1]

#-------------------------------
#merge the entry and exit times
#-------------------------------
myKeys <-  c("keys_pair","minutesSinceStart")
setkeyv(dt,myKeys)
setkeyv(exitLongs,myKeys)
dt <- merge(dt,exitLongs,all.x=TRUE)

#-----------------------------
# no match = no exit 
#-----------------------------
dt[is.na(longExit), longExit := 0]

#--------------------------------------------------
# if there is an exit and enter, change the exit
#--------------------------------------------------
dt[longEntry == 1 & longExit == 1, longExit := 0]

#-----------------------------------------------------
#assign a group number to consecutive runs of entries
#-----------------------------------------------------
setkeyv(dt,myKeys)
dt[,grp := rleidv(dt, cols="longEntry")] 
dt[,grp1 := paste0(keys_pair,grp)]

dt[, inc := cumsum(longEntry),by=grp1]
dt[, inc1 := cumsum(inc),by=grp1]

#--------------------------------------
# the first entry in the run is a BUY
#--------------------------------------
dt[,buy := 0]
dt[longEntry==1 & inc1 == 1,buy :=1 , by=grp1]

#-----------------------------------------------------
#assign a group number to consecutive runs of exits
#-----------------------------------------------------
dt[,grp := rleidv(dt, cols="longExit")] 
dt[,grp1 := paste0(keys_pair,grp)]

dt[, inc := cumsum(longExit),by=grp1]
dt[, inc1 := cumsum(inc),by=grp1]

#-------------------------------------
# The first exit in the run is a SELL
#-------------------------------------
dt[,sell := 0]
dt[longExit==1 & inc1 == 1,sell :=1 , by=grp1]


#---------------------------------------------------
# Keep just the rows where we are BUYING or SELLING
#---------------------------------------------------
d1 <- subset(dt,(sell==1 & buy==0 | sell==0 & buy==1),select = c('keys_pair','minutesSinceStart','tradePrice','buy','sell'))

#------------------------------------------
# Make sure each buy is followed by a sell
#------------------------------------------
d1[,prevRowBuy :=  shift(buy, 1, type="lag") , by=keys_pair]
d1[,prevRowSell :=  shift(sell, 1, type="lag") , by=keys_pair]
d1 <- subset(d1,prevRowBuy != buy & prevRowSell != sell)

#--------------------------------------------------------------
#find the next price (for buy rows it will be the sell price)
#--------------------------------------------------------------
d1[,sellPrice := shift(tradePrice, 1, type="lead"), by=keys_pair]
d1[,minutesSinceStartExit := shift(minutesSinceStart, 1, type="lead"), by=keys_pair]
d1[, tradeLength := minutesSinceStartExit - minutesSinceStart]


#----------------------------------------------------------------
# now we only need the buy rows as we have the sell time & price
#----------------------------------------------------------------
d1 <- subset(d1,buy==1 & !is.na(minutesSinceStartExit))


#-------------------------------------------------------------------
# long profit estimate (we only have prices for some pairs though)
# this does not include a stoploss calculation
#--------------------------------------------------------------------
d1[,tradeProfit := ( (sellPrice * (1 - feePerc)) - (tradePrice * (1 + feePerc)) )/ tradePrice]

hist(d1$tradeProfit,breaks=100,col='blue');abline(v=0,col='red')


#------------------------------
# generate a submission file 
#------------------------------
stratName <- 'demo_LongOnly'
d1[,enterTime := minutesSinceStart]
d1[,exitTime := minutesSinceStartExit]
d1[,direction := 1]
d1[,percentPair := 1]
d1[,strategyName := stratName]

d1 <- subset(d1,select = c('keys_pair','enterTime','exitTime','direction','percentPair','strategyName'))

fwrite(d1,paste0(dataFolder,stratName,'.csv'))

24 comments:

Saurabh4 September 2019 at 08:13
Hi,

would you be able to provide the sample code in python as well?

Thanks
ReplyDelete
Replies
Sali Mali4 September 2019 at 19:57
Sorry, no. You are the data scientists so you should be able to do that yourself.
If anyone want's to post some python equivalent then that may be appreciated by others.
ReplyDelete
Replies
Sam5 September 2019 at 07:09
So I have a problem with this type of submission:

The Lpred7b variable is on a different scale for each pair, so the threshold is likely to not be comparable across them. This poses a problem with the new data is given in the final evaluation round - How can any trading logic be built on the provided trading pairs if the magnitude of these predictions are not transferable to new data?

If this isn't clear enough, let me demonstrate my concern with a simple example. I might have a model of the form:

Value = 0.5 * Pred1 + 0.3 * Pred2

Then use this "Value" variable to determine my trading logic. If the new pairs in the new dataset are of a different scale, then Pred1 and Pred2 for the newly provided pairs cannot be used in my pre-trained model above, as they relate to different variables that just so happen to have the same name.

Ultimately this leads to the question -> will the closing prices and predicted fields of the TRAINING pairs also be included on the new data given at the end so that my built model can actually be used.
ReplyDelete
Replies
Sali Mali5 September 2019 at 21:46
Hi Sam,

If the predictions appear to be on a different scale then it's probably because the thing we are predicting varies between pairs. If a pair never changed in price then our predictions would hopefully not change much either. Another pair might be more volatile and hopefully our predictions would match this. We would rather trade this latter pair.

Hope this makes sense?
ReplyDelete
Replies
Sam6 September 2019 at 01:01
it does make sense, except the learned model in the training phase is learned on the magnitudes of only the pairs its seen. Further to that, a model might make use of the price action of Bitcoin_USD (like a 14 day moving average) but I'm not sure if this can even be done because if the new data doesn't contain the closing prices of Bitcoin_USD, then my model wont work on the new data provided at the end.

There is a lot of confusion in regards to how all of this runs, which risks making a lot of peoples strategies not being applicable on the final dataset. I definitely think further clarification on what can and can't be used on the unseen data and what will and will not be given would be very helpful to a lot of people. I'm surprised theres not as many questions on this as there should be.
ReplyDelete
Replies
Khurram Siddiqui8 September 2019 at 16:30
that make sense , we don't have any past decision in given file how will i teach my model what to take decision based on pre 1-14? Thanks for update
ReplyDelete
Replies
Khurram Siddiqui6 October 2019 at 21:08
Hi Phil,

What does mean by longThresh <- 0.002 #the value the prediction has to be higher than?

does it mean we have to work only on those data which has more than 0.002 predictions of each 14?
ReplyDelete
Replies
Sali Mali6 October 2019 at 21:12
it's setting a parameter value, so whenever in the rest of the code you see 'longThresh', substitute 0.002.
ReplyDelete
Replies
Photo Editing3 November 2019 at 05:55
You made an article that is interesting.You are so amazing.
Thanks a million and please keep up the effective work. clipping photoshop
ReplyDelete
Replies
http://www.designercountry.com/5 December 2019 at 11:31
Writing or rewriting a script can be a long and lengthy task but, if you enjoy being creative and you don't mind the time it takes to write or rewrite a script,
ReplyDelete
Replies
seo master9 December 2019 at 03:17
Thanks for every other informative site. The place else may just I get that kind of information written in such an ideal means? I have a venture that I’m just now operating on, and I have been on the look out for such information. buy real instagram followers
ReplyDelete
Replies

Add comment