Simplified Intro to Basic ML/AI in Stock Prediction Bots
With the current climate of the stock market compared to the economy, people are looking at investments in the stock market. Weather this is caused by a true view of a value increase or the possibility of the Fed buying some stocks and ETF, the result is the same. The gain is artificial, the interest in the markets is not. The best thing to do is start looking at the markets and see what can be done with some of the most modern data science techniques. Starting with some simple ML/AI methodologies, and how they may be able to fit into the big picture is a great start.
Here is the Repo for the Colab if you want to review this yourself.
For our purposes we will start with the most widely available datasets and data as we can’t do any ML/AI without a good amount of data. Freely available day stock prices data is plentiful and available. This type of data is best used for what is called swing trading. Intraday data would be better for day trading. This gives a faster real world feedback loop but that data is typically only available through paid APIs.
Now that we know what data we will use and have an idea of the best usage of that data we would need to first run it through a typical ETL pipeline for standardizing and normalizing the data. This can be done fairly automated in a pipeline for one feature models that have decently small ranges in data.
We will use the Apple data to make things easy. First let’s look at the libraries we will use and then go step by step into some simplified models and stock trading APIs.
!pip install pandas!pip install sklearn!pip install matplotlib!pip install alpaca-trade-api
Pandas is a data science library for data manipulation and analysis. It is the most common one used so you should get familiar with it.
scikit-learn is a machine learning library in python. The user guide has some good examples here. It supports some of the simplified models such as linear or logistic regression.
Matplot is a python libraries used for visualizations such as graphs and for data analysis through visuals.
Next we look at libraries we are using withing those packages as some of these are used as name functions used later.
import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_split, cross_val_score, KFold, learning_curvefrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LinearRegressionfrom sklearn.preprocessing import StandardScaler
Here we can see we import a number of named functions, remember these functions as we will be using them later and as we use each I will explain there function.
AAPL= pd.read_csv("AAPL.csv").drop(['Close'], axis=1)AAPL.set_index('Date', drop=True, inplace=True)
First let’s load the data into the pandas and in doing so drop the “Close” field just so we do not mix it up with the “Adj Close” which we will use for the predictions.
Next in most cases you will want to view some of the data just to get an idea of format issues or missing values.
AAPL.head()
This data should be fairly clean, the thing you want to check for here is missing values, large variances and for multi feature scenarios how to do scaling as well. For our purposes we are only using the “Adj Close”, but if we did multiple features then the vast difference in “Volume” and “Adj Close” would make the scaling a bit more of a task.
Now that we have the data loaded we will need to split the into our train test split in order to train our models.
X = AAPL['Adj Close'].shift(10).dropna()y = AAPL['Adj Close']trainX, testX, trainY, testY = train_test_split(X.values, y[10:].values, test_size=.2)trainX, trainY, testX, testY = trainX.reshape(-1, 1), trainY.reshape(-1, 1), testX.reshape(-1, 1), testY.reshape(-1, 1)
We shift the data for comparison purposes and split the data into our train test split bins. Then the models only take certain shapes for the data and we must reshape the data to be able to put it through our ML models training. You can see we used those named functions from earlier of the train_test_split from the sci-kit learn library.
We have the data ready to train the model and now we will define the pipeline to do the scaling and train the model.
#Simple linear regression model definition along with transformation pipelinelr = Pipeline([("scaler", StandardScaler()),("lr", LinearRegression())])
Before we train we want to get an idea of the possible and so we can get a general idea of how accurate we can get.
kfold = KFold(n_splits=10)lr_score = cross_val_score(lr, trainX, trainY, cv=kfold, scoring='neg_mean_squared_error')lr_score.mean()-44.11564926863038
Not great accuracy, but with one feature, one time step back giving a prediction of one time step forward, I was not expecting great results, but more to teach/learn so this should do fine.
Let us get into the exciting part, training the model.
# Review predictions against the remaining datalr.fit(trainX, trainY)preds = lr.predict(testX)
Once that has completed let’s see how it worked.
#Plot data for reviewplt.plot([a for a in range(len(testY))], testY, 'r-', label='actual')plt.plot([a for a in range(len(preds))], preds, 'g--', label='model')plt.legend()plt.show()
Now in training the model you can easily output the accuracy as the model see’s it. However in my opinion this is typically a very bias view and it is typically best to test on a hold out dataset that is totally new so as not to be over confident from over-fitting.
backtest = pd.read_csv("AAPL_Backtest.csv")backtest.head()
Let’s cycle through all the backtest to get the average and high variances per prediction.
errors = []counter = 0pred = Nonemaxprice = 0;avgprice = 0;#Get the average amount off per prediction#This is our average error per call#Max and Min would also be important herefor price in backtest['Adj Close'].values[::1000]:absprice = abs(price)if absprice > maxprice:maxprice = priceif counter == 0:pred = lr.predict([[price]])else:errors.append(pred - price)pred = lr.predict([[price]])counter += 1#Over short periods of time such as 10 values the error might seem erroneously low deviationprint("Avg error in dollars: ", (sum(errors) / len(errors)))print("Max difference: ", maxprice)Avg error in dollars: [[-43.47646708]] Max difference: 66.751823
So we have a model trained, we know how to use it and we have an idea of accuracy. We weren’t expecting great accuracy but for the newer datasets with an estimated 200 range of price that is an estimated 30% variance, not great. However many other factors and features need to be included to get a more realistic and accurate prediction.
Typically from here you would use this data to start making prediction and then making trades based on this data. For our purposes you will need to come up with your own trading strategy, but we can give you the basics of integration in the repo and google colab.
Hopefully that was a good intro and in the next sections we will be looking at more complicated models and seeing how good the accuracy can get.