Another Data Mining Blog: Buy Low, Sell High

2019

This is information about the 2019 Melbourne Datathon - Please read the previous post first

The data is a single file and can be downloaded from the following links:

https://drive.google.com/open?id=11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR (553mb zipped)

https://drive.google.com/open?id=1WuFFk7NkBgYJxgDx8jeohwiLwwOMzLZP (1.5gb unzipped)

The column names should be self explanatory. There are 2 price columns;

barClosePrice - this is the price at the current time stamp 'minutesSinceStart'. It is the price which the predictions (LPred1b-LPred14b) are predicting the change of.

tradePrice - this is the price 5 minutes later. It is the price that will be used in the returns calculations. The reason for this is that we are assuming it takes 5 minutes from the time we decide to trade to actually executing the trade.

Below us some R code to get you going, and we'll be posting more in the coming days.

#--------------------------------------------------------------------------------------
# some R code to get you started with the 2019 Melbourne Datathon analytic challenge
# download the data from the link below and unzip to 'dataFolder'
# https://drive.google.com/file/d/11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR (553 mb)
#--------------------------------------------------------------------------------------

library(data.table)

#define where the data is
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,"melbdatathon2019_buylowsellhigh.csv")

#read in the data
dt <- fread(theDataFile)

nrow(dt)
#4,907,361

colnames(dt)
# [1] "keys_pair"         "minutesSinceStart" "gap"               "barClosePrice"     "tradePrice"        "Lpred1b"           "Lpred2b"           "Lpred3b"           "Lpred4b"          
#[10] "Lpred5b"           "Lpred6b"           "Lpred7b"           "Lpred8b"           "Lpred9b"           "Lpred10b"          "Lpred11b"          "Lpred12b"          "Lpred13b"         
#[19] "Lpred14b" 


unique(dt$keys_pair)
#[1] "0x_bitcoin"           "bitcoin_usdollar"     "bitcoincash_usdollar" "cardano_bitcoin"      "dash_usdollar"        "litecoin_bitcoin"     "litecoin_tetherusd"   "monero_bitcoin"      
#[9] "qtum_bitcoin"         "ripple_bitcoin"       "ripple_usdollar"      "stratis_bitcoin"      "tron_tetherusd"       "zcash_bitcoin"        "pair_1"               "pair_2"              
#[17] "pair_3"               "pair_4"               "pair_5"               "pair_6"               "pair_7"               "pair_8"               "pair_9"               "pair_10"             
#[25] "pair_11"              "pair_12"              "pair_13"              "pair_14"              "pair_15"              "pair_16"              "pair_17"              "pair_18"             
#[33] "pair_19"              "pair_20"              "pair_21"              "pair_22"       


#price information is missing for the unnamed pairs
x1 <- subset(dt,keys_pair== "pair_1")
nrow(x1)
#132,721
summary(x1)


#distribution of the predictions
hist(dt$Lpred1b,breaks=100)

#time series of prices and predictions
x <- subset(dt,keys_pair== "bitcoin_usdollar")[1:1000]
plot(x$minutesSinceStart,x$barClosePrice,type='l',col="blue")
plot(x$minutesSinceStart,x$Lpred1b,type='l',col="blue")

4 comments:

BM27 August 2019 at 03:28
Says above that data's coming on Wed 27th August - the next one of which is in 2025 :-)
dheep15 October 2019 at 01:18
Below is the python code to start up (similar to R code given in this post above)

# anaconda python 3.6 and above

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Assuming file is stored in data folder in the same directory 'data/melbdatathon2019_buylowsellhigh.csv'
datafile = "..\data\datamelbdatathon2019_buylowsellhigh.csv"

#load data file
df = pd.read_csv("..\data\melbdatathon2019_buylowsellhigh.csv")

#check the shape of the data frame
df.shape

# list column names
df.columns

# list unique key_pairs
print(f"unique key_pairs = {df['keys_pair'].unique().shape[0]}")
df['keys_pair'].unique()

#price information is missing for the unnamed pairs
filter_x1 = df.keys_pair == 'pair_1'
x1 = df[filter_x1]
x1.shape
#132,721, 19

# summarise dataframe
x1.describe().transpose()

#distribution of the predictions
df.Lpred1b.hist(bins=100, figsize=(10, 6))

#time series of prices and predictions
filter_x2 = df.keys_pair == 'bitcoin_usdollar'
x = df[filter_x2].head(1000)

sns.lineplot(x='minutesSinceStart', y='barClosePrice', data=x)
sns.lineplot(x='minutesSinceStart', y='Lpred1b', data=x)

Another Data Mining Blog

Tuesday, 27 August 2019

Buy Low, Sell High - DATA and Code

4 comments: