src package

Submodules

src.BayesReg module

Code for the Gaussian Process Model implementation.

@author: Shashank Swaminathan

class src.BayesReg.GPM(ticker, start_date=datetime.datetime(2000, 1, 1, 0, 0), end_date=datetime.datetime(2019, 12, 13, 12, 36, 5, 34810))[source]

Bases: object

Class encapsulating Gaussian Process model implementation. Depends on the PyMC3 library.

__init__(ticker, start_date=datetime.datetime(2000, 1, 1, 0, 0), end_date=datetime.datetime(2019, 12, 13, 12, 36, 5, 34810))[source]

init function. Fetches and prepares data from desired ticker using Company class from accompanying get_data API. Assumes start and end dates of January 1st, 2000 -> today.

Parameters
  • ticker – Ticker of stock to predict.

  • start_date – Datetime object of when to start retrieving stock data. Defaults to 1/1/2000.

  • end_date – Datetime object of when to stop retrieving stock data. Defaults to today’s date.

go(start_date=None, split_date=Timestamp('2019-09-30 00:00:00'), end_date=None)[source]

Main function to train the model and predict future data. First generates training and testing data, then trains the GP on the training data. Finally, it predicts on the time range specified.

Parameters
  • start_date – Date to start training GP. If left as None, defaults to starting date of training data set.

  • split_date – Date to end training.

  • start_date – Date to end predicting using GP. If left as None, defaults to ending date of training data set. Assumes prediction starts right after training.

Returns

A tuple of four arrays, containing the mean predictions, upper/lower bounds of predictions, training data, and test data for comparison. Predictions done on non business days are dropped.

src.CombinedModel module

Code for the combined model approach.

@author: Shashank Swaminathan

class src.CombinedModel.CombinedModel(ticker, comp_tickers)[source]

Bases: object

Class for handling combined model operations.

__init__(ticker, comp_tickers)[source]

init function. It will set up the StockRNN and GPM classes.

Parameters
  • ticker – Ticker of stocks to predict

  • comp_tickers – List of tickers to compare desired ticker against. Used for StockRNN only.

train(start_date, pred_start, pred_end, mw=0.5, n_epochs=10)[source]

Main training function. It runs both the LSTM and GP models and stores results in attributes.

Parameters
  • start_date – Training start date (for GP model only). Provide as datetime object.

  • pred_start – Date to start predictions from. Provide as datetime object.

  • pred_end – Date to end predictions. Provide as datetime object.

  • mw – Model weight. Used to do weighted average between GP and LSTM. 0 is for only the LSTM, and 1 is for only the GP. Defaults to 0.5 (equal split).

  • n_epochs – Number of epochs to train the LSTM. Defaults to 10.

Returns

(Mean predictions [t, y], Upper/lower bounds of 2 std [t, y])

src.StockRNN module

Code to train the RNN

@author: Duncan Mazza

class src.StockRNN.StockRNN(ticker: str, lstm_hidden_size: int = 100, lstm_num_layers: int = 2, to_compare: [<class 'str'>] = None, train_start_date: datetime.datetime = datetime.datetime(2017, 1, 1, 0, 0), train_end_date: datetime.datetime = datetime.datetime(2018, 1, 1, 0, 0), sequence_segment_length: int = 50, drop_prob: float = 0.3, device: str = 'cuda', auto_populate: bool = True, train_data_prop: float = 0.8, lr: float = 0.0001, train_batch_size: int = 10, test_batch_size: int = 4, num_workers: int = 2, label_length: int = 30, try_load_weights: bool = False, save_state_dict: bool = True)[source]

Bases: torch.nn.modules.module.Module

Class for training on and predicting stocks using a LSTM network

__init__(ticker: str, lstm_hidden_size: int = 100, lstm_num_layers: int = 2, to_compare: [<class 'str'>] = None, train_start_date: datetime.datetime = datetime.datetime(2017, 1, 1, 0, 0), train_end_date: datetime.datetime = datetime.datetime(2018, 1, 1, 0, 0), sequence_segment_length: int = 50, drop_prob: float = 0.3, device: str = 'cuda', auto_populate: bool = True, train_data_prop: float = 0.8, lr: float = 0.0001, train_batch_size: int = 10, test_batch_size: int = 4, num_workers: int = 2, label_length: int = 30, try_load_weights: bool = False, save_state_dict: bool = True)[source]
Parameters
  • lstm_hidden_size – size of the lstm hidden layer

  • lstm_num_layers – number of layers for the lstm

  • ticker – ticker of company whose stock you want to predict

  • to_compare – ticker of companies whose stock will be part of the features of the dataset

  • train_start_date – date to request data from

  • train_end_date – date to request data to

  • sequence_segment_length – length of sequences to train the model on

  • drop_prob – probability for dropout layers

  • device – string for device to try sending the tensors to (i.e. “cuda”)

  • auto_populate – automatically calls all ‘populate’ functions in the constructor

  • train_data_prop – proportion of data set to allocate to training data

  • lr – learning rate for the optimizer

  • train_batch_size – batch size for the training data

:param test_batch_size:batch size for the testing data :param num_workers: parameter for Pytorch DataLoaders :param label_length: length of data (starting at the end of each sequence segment) to consider for the loss :param try_load_weights: boolean for whether the model should search for a cached model state dictionary :param save_state_dict: boolean for whether the model should cache its weights as a state dictionary

check_sliding_window_valid_at_index(end_pred_index, pred_beyond_range)[source]

Checks that the index parameter for creating a distribution of predictions is valid for the dataset, and modifies it if it isn’t (as well as prints a warning describing the condition)

Parameters
  • end_pred_index – index of the date that is desired to be predicted

  • pred_beyond_range – tuple containing the range of the number of forecasted days the model will use to

arrive at a prediction at end_pred_index :return: end_pred_index (modified if necessary)

do_training(num_epochs: int, verbose=True, plot_output: bool = True, plot_output_figsize: (<class 'int'>, <class 'int'>) = (5, 10), plot_loss: bool = True, plot_loss_figsize: (<class 'int'>, <class 'int'>) = (7, 5))[source]

This method trains the network using data in train_loader and checks against the data in test_loader at the end of each epoch. The forward pass through the network produces sequences of the same length as the input sequences. The sequences in the label data are of length label_length, so the output sequences are cropped to length label_length before being passed through the MSE loss function. Because each element of the output sequence at position n is a prediction of the input element n+1, the cropped windows of the output sequences are given by the window that terminates at the second-to-last element of the output sequence.

Parameters
  • num_epochs – number of epochs to to run the training for

  • verbose – if true, print diagnostic progress updates and final training and test loss

  • plot_output – if true, plot the results of the final pass through the LSTM with a randomly selected

segment of data :param plot_output_figsize: figsize argument for the output plot :param plot_loss: if true, plot the training and test loss :param plot_loss_figsize: figsize argument for the loss plot

forward(X: torch.Tensor, predict_beyond: int = 0)[source]

Completes a forward pass of data through the network. The tensor passed in is of shape (batch size, features, sequence length), and the output is of shape (batch size, 1, sequence length). The data is passed through a LSTM layer with an arbitrary number of layers and an arbitrary hidden size (as defined by lstm_hidden_size and lstm_num_layers; the output is then passed through 2 fully connected layers such that the final number of features is the same as the input number of features (num_companies)

Parameters
  • X – input matrix of data of shape: (batch size, features (number of companies), sequence length)

  • predict_beyond – number of days to recursively predict beyond the given input sequence

Returns

output of the forward pass of the data through the network (same shape as input)

generate_predicted_distribution(end_pred_index: int = None, pred_beyond_range: (<class 'int'>, <class 'int'>) = (1, 10))[source]

Returns a list of predicted stock values at a given date using a range of forecast lengths

Parameters
  • end_pred_index – index of the date that is desired to be predicted

  • pred_beyond_range – tuple containing the range of the number of forecasted days the model will use to

arrive at a prediction at end_pred_index :return: list of predicted values (of length given by pred_beyond_range) :return: actual stock value corresponding to the predictions

make_prediction_with_validation(predict_beyond: int = 30, num_plots: int = 2, data_start_indices: numpy.ndarray = None)[source]

Selects data from the dataset and makes a prediction predict_beyond days out, and the actual values of the stock are shown alongside.

Parameters
  • predict_beyond – days to predict ahead in the future

  • data_start_indices – indices corresponding to locations in the total dataset sequence for the training

data to be gathered from (with the training data being of length sequence_segment_length) :return: length of the data being returned (training + prediction sequences) :return: datetime objects corresponding to data_start_indices :return: datetime objects corresponding to the end of the returned sequences :return: indices corresponding to the days where the predicted sequence starts :return: input and label sequence data associated with each pass of the model :return: numpy array of the model output :return: training data (in absolute stock value form instead of the % change that the model sees) :return: output prediction of the model converted from % change to actual stock values :return: label data (in absolute stock value form instead of % change) to compare to the output prediction :return: disparity between predicted stock values and actual stock values

peek_dataset(figsize: (<class 'int'>, <class 'int'>) = (10, 5))[source]

Creates a simple line plot of the entire dataset

Parameters

figsize – tuple of integers for plt.subplots figsize argument

plot_predicted_distribution(latest_data_index: int = None, pred_beyond_range: (<class 'int'>, <class 'int'>) = (1, 10))[source]

TODO: documentation

plot_prediction_with_validation(predict_beyond: int = 30, num_plots: int = 5, plt_scl=20)[source]

A method for debugging/validating make_prediction_with_validation - makes predictions and shows the raw output of the model, reconstructed stock prices, and disparity between predicted stock prices and actual stock prices.

Parameters
Plt_scl

integer for width and heigh parameters of matplotlib plot

populate_daily_stock_data()[source]

Populates self.daily_stock_data with the day-over-day percent change of the closing stock prices. The data for each company is truncated such that each company’s array of data is the same length as the rest and such that their length is divisible by :attr:` sequence_segment_length`.

populate_loaders()[source]

Populates train_loader, test_laoder, train_loader_len, and :attr:`test_loader_len attributes.

populate_test_train(rand_seed: int = -1)[source]

Populates self.train_data and self.test_data tensors with complimentary subsets of the sequences of self.daily_stock_data, where the sequences are the self.sequence_length length sequences of data that, when concatenated, comprise self.daily_stock_data.

Parameters

rand_seed – value to seed the random number generator; if -1 (or any value < 0), then do not seed the random number generator.

pred_in_conj(start_of_pred_idx: int, n_days: int, pred_beyond_range: (<class 'int'>, <class 'int'>) = (1, 10))[source]

Calls :method:`generate_predicted_distribution` to create a list of predictions for each day given in a given range, and returns the mean and standard deviation associated with each day.

Parameters
  • start_of_pred_index – integer corresponding to the first date whose distribution will be predicted

  • n_days – number of days from start_of_pred_index to predict out

  • pred_beyond_range – tuple containing the range of the number of forecasted days the model will use to

arrive at a prediction at end_pred_index :return: list of length n_days of the mean values associated with each day’s predicted stock :return: list of length n_days of the standard deviation associated with each day’s predicted stock

return_loaders() → [<class 'torch.utils.data.dataloader.DataLoader'>, <class 'torch.utils.data.dataloader.DataLoader'>][source]

Returns the torch.utils.data.Dataloader objects for the training and test sets

Returns

training DataLoader

Returns

testing DataLoader

src.get_data module

Utility functions for acquiring financial data

@author: Duncan Mazza

class src.get_data.Company(ticker: str, start_date: datetime.datetime, end_date: datetime.datetime, call_populate_dataframe: bool = True, cache_bool: bool = True)[source]

Bases: object

__init__(ticker: str, start_date: datetime.datetime, end_date: datetime.datetime, call_populate_dataframe: bool = True, cache_bool: bool = True)[source]

TODO: documentation here

Parameters
  • ticker

  • start_date

  • end_date

  • call_populate_dataframe

  • cache_bool

cache(file_path: str, data_frame: pandas.core.frame.DataFrame = None)[source]

Saves a DataFrame as a .csv to a path relative to the current working directory.

Parameters

file_path – path to save the DataFrame to; if not an absolute path, then it is used as a path

relative to the current working directory. :param data_frame: DataFrame to save (if not specified, will use data_frame (attribute)

get_date_at_index(i)[source]

Returns the datetime object at index

Parameters

i – index to return the date of

static moving_average(a, n, padding: bool = True)[source]

Calculates the moving average of a one-dimensional numpy array a, capable of utilizing a padding of length n - 1 at the beginning of the pre-filtered data so that the length is not truncated; the padding is populated with the same value as the first value of the % change sequence. :param a: one-dimensional numpy array :param n: length of rolling average filter :param padding: boolean for whether to utilize padding :return: filtered array

populate_dataframe()[source]

Populates data_frame with stock data acquired using pandas_datareader.data. View more information here. Modifies start_date, start_date_changed, end_date, and end_date_changed if start_date and/or end_date are different than the actual start and end dates in data_frame such that start_date and end_date equal the actual start and end dates in data_frame (and start_date_changed and end_date_changed reflect whether start_date and end_date were changed respectively).

reconstruct_stock_from_percent_change(percent_change_vec: numpy.ndarray, initial_condition_index: int)[source]

Reconstruct the stock prices from percent change

Parameters
  • percent_change_vec – vector of percent changes

  • initial_condition_index – index of initial condition for the % change

return_data(ticker: str = None, start_date: datetime.datetime = None, end_date: datetime.datetime = None) → pandas.core.frame.DataFrame[source]

Returns the DataFrame containing the financial data for the prescribed company. This function will pull the data from the Yahoo API built into pandas_datareader if it has not been cached and will then cache the data, or it will read the data from the cached csv file. The cached files are named with the ticker, start date, and end dates that specify the API query, and exist in the .cache/ folder located under the current working directory.

Parameters
  • ticker – ticker string for the company whose data will be retrieved

  • start_date – start date for the data record

  • end_date – end date for the data record

Returns

DataFrame of financial data

return_dummy_data()[source]

Creates linear stock data as dummy data for testing a model

Returns

numpy array of dummy data

return_numpy_array_of_company_daily_stock_close() → numpy.ndarray[source]

Returns a numpy array of the “Close” column of data_frame.

Returns

numpy array of closing stock prices indexed by day

return_numpy_array_of_company_daily_stock_percent_change(rolling_avg_length: int = 4) → numpy.ndarray[source]

Converts the numpy array of the closing stock data (acquired by calling :method:`return_numpy_array_of_company_daily_stock_close`) into an array of day-over-day percent change. Adds a value of 0 at the beginning of the array to maintain sequence length.

Parameters

apply_rolling_avg – if nonzero, applies a rolling average filter to the percent change data (see

:method:`moving_average` for details on how the rolling average is calculated with padding) :return: numpy array of length 1 less than the array generated by :method:`return_numpy_array_of_company_daily_stock_close`

revise_end_date(new_end_date: datetime.datetime)[source]

Modifies data_frame such that the last date of the data is equal to new_end_date (all following data is deleted).

Parameters

new_end_date – a datetime object of the new end date for data_frame (where new_end_date exists and is unique in self.data_frame["Date"]

revise_start_date(new_start_date: datetime.datetime)[source]

Modifies data_frame such that the starting date of the data is equal to new_start_date (all prior data is deleted).

Parameters

new_start_date – a datetime object of the new start date for data_frame (where new_start_date exists and is unique in self.data_frame["Date"]

src.prob_train module

Module contents