Tuesday 27 August 2019

Buy Low, Sell High - DATA and Code


2019

This is information about the 2019 Melbourne Datathon - Please read the previous post first


The data is a single file and can be downloaded from the following links:

https://drive.google.com/open?id=11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR (553mb zipped)

https://drive.google.com/open?id=1WuFFk7NkBgYJxgDx8jeohwiLwwOMzLZP (1.5gb unzipped)


The column names should be self explanatory. There are 2 price columns;

barClosePrice - this is the price at the current time stamp 'minutesSinceStart'. It is the price  which the predictions (LPred1b-LPred14b) are predicting the change of.

tradePrice - this is the price 5 minutes later. It is the price that will be used in the returns calculations. The reason for this is that we are assuming it takes 5 minutes from the time we decide to trade to actually executing the trade.



Below us some R code to get you going, and we'll be posting more in the coming days.


#--------------------------------------------------------------------------------------
# some R code to get you started with the 2019 Melbourne Datathon analytic challenge
# download the data from the link below and unzip to 'dataFolder'
# https://drive.google.com/file/d/11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR (553 mb)
#--------------------------------------------------------------------------------------

library(data.table)

#define where the data is
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,"melbdatathon2019_buylowsellhigh.csv")

#read in the data
dt <- fread(theDataFile)

nrow(dt)
#4,907,361

colnames(dt)
# [1] "keys_pair"         "minutesSinceStart" "gap"               "barClosePrice"     "tradePrice"        "Lpred1b"           "Lpred2b"           "Lpred3b"           "Lpred4b"          
#[10] "Lpred5b"           "Lpred6b"           "Lpred7b"           "Lpred8b"           "Lpred9b"           "Lpred10b"          "Lpred11b"          "Lpred12b"          "Lpred13b"         
#[19] "Lpred14b" 


unique(dt$keys_pair)
#[1] "0x_bitcoin"           "bitcoin_usdollar"     "bitcoincash_usdollar" "cardano_bitcoin"      "dash_usdollar"        "litecoin_bitcoin"     "litecoin_tetherusd"   "monero_bitcoin"      
#[9] "qtum_bitcoin"         "ripple_bitcoin"       "ripple_usdollar"      "stratis_bitcoin"      "tron_tetherusd"       "zcash_bitcoin"        "pair_1"               "pair_2"              
#[17] "pair_3"               "pair_4"               "pair_5"               "pair_6"               "pair_7"               "pair_8"               "pair_9"               "pair_10"             
#[25] "pair_11"              "pair_12"              "pair_13"              "pair_14"              "pair_15"              "pair_16"              "pair_17"              "pair_18"             
#[33] "pair_19"              "pair_20"              "pair_21"              "pair_22"       


#price information is missing for the unnamed pairs
x1 <- subset(dt,keys_pair== "pair_1")
nrow(x1)
#132,721
summary(x1)


#distribution of the predictions
hist(dt$Lpred1b,breaks=100)

#time series of prices and predictions
x <- subset(dt,keys_pair== "bitcoin_usdollar")[1:1000]
plot(x$minutesSinceStart,x$barClosePrice,type='l',col="blue")
plot(x$minutesSinceStart,x$Lpred1b,type='l',col="blue")


4 comments:

  1. Says above that data's coming on Wed 27th August - the next one of which is in 2025 :-)

    ReplyDelete
  2. Below is the python code to start up (similar to R code given in this post above)

    # anaconda python 3.6 and above

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns

    %matplotlib inline

    # Assuming file is stored in data folder in the same directory 'data/melbdatathon2019_buylowsellhigh.csv'
    datafile = "..\data\datamelbdatathon2019_buylowsellhigh.csv"

    #load data file
    df = pd.read_csv("..\data\melbdatathon2019_buylowsellhigh.csv")

    #check the shape of the data frame
    df.shape

    # list column names
    df.columns

    # list unique key_pairs
    print(f"unique key_pairs = {df['keys_pair'].unique().shape[0]}")
    df['keys_pair'].unique()

    #price information is missing for the unnamed pairs
    filter_x1 = df.keys_pair == 'pair_1'
    x1 = df[filter_x1]
    x1.shape
    #132,721, 19

    # summarise dataframe
    x1.describe().transpose()

    #distribution of the predictions
    df.Lpred1b.hist(bins=100, figsize=(10, 6))

    #time series of prices and predictions
    filter_x2 = df.keys_pair == 'bitcoin_usdollar'
    x = df[filter_x2].head(1000)

    sns.lineplot(x='minutesSinceStart', y='barClosePrice', data=x)
    sns.lineplot(x='minutesSinceStart', y='Lpred1b', data=x)

    ReplyDelete
    Replies
    1. Better to run in python notebook. You can download from my git repo
      https://github.com/dheepdatascigit/dsmdatathon2019help/blob/master/buylowsellhigh_startup01.ipynb

      Delete