src package¶
Submodules¶
src.BayesReg module¶
Code for the Gaussian Process Model implementation.
@author: Shashank Swaminathan
-
class
src.BayesReg.
GPM
(ticker, start_date=datetime.datetime(2000, 1, 1, 0, 0), end_date=datetime.datetime(2019, 12, 13, 12, 36, 5, 34810))[source]¶ Bases:
object
Class encapsulating Gaussian Process model implementation. Depends on the PyMC3 library.
-
__init__
(ticker, start_date=datetime.datetime(2000, 1, 1, 0, 0), end_date=datetime.datetime(2019, 12, 13, 12, 36, 5, 34810))[source]¶ init function. Fetches and prepares data from desired ticker using Company class from accompanying get_data API. Assumes start and end dates of January 1st, 2000 -> today.
- Parameters
ticker – Ticker of stock to predict.
start_date – Datetime object of when to start retrieving stock data. Defaults to 1/1/2000.
end_date – Datetime object of when to stop retrieving stock data. Defaults to today’s date.
-
go
(start_date=None, split_date=Timestamp('2019-09-30 00:00:00'), end_date=None)[source]¶ Main function to train the model and predict future data. First generates training and testing data, then trains the GP on the training data. Finally, it predicts on the time range specified.
- Parameters
start_date – Date to start training GP. If left as None, defaults to starting date of training data set.
split_date – Date to end training.
start_date – Date to end predicting using GP. If left as None, defaults to ending date of training data set. Assumes prediction starts right after training.
- Returns
A tuple of four arrays, containing the mean predictions, upper/lower bounds of predictions, training data, and test data for comparison. Predictions done on non business days are dropped.
-
src.CombinedModel module¶
Code for the combined model approach.
@author: Shashank Swaminathan
-
class
src.CombinedModel.
CombinedModel
(ticker, comp_tickers)[source]¶ Bases:
object
Class for handling combined model operations.
-
__init__
(ticker, comp_tickers)[source]¶ init function. It will set up the StockRNN and GPM classes.
- Parameters
ticker – Ticker of stocks to predict
comp_tickers – List of tickers to compare desired ticker against. Used for StockRNN only.
-
train
(start_date, pred_start, pred_end, mw=0.5, n_epochs=10)[source]¶ Main training function. It runs both the LSTM and GP models and stores results in attributes.
- Parameters
start_date – Training start date (for GP model only). Provide as datetime object.
pred_start – Date to start predictions from. Provide as datetime object.
pred_end – Date to end predictions. Provide as datetime object.
mw – Model weight. Used to do weighted average between GP and LSTM. 0 is for only the LSTM, and 1 is for only the GP. Defaults to 0.5 (equal split).
n_epochs – Number of epochs to train the LSTM. Defaults to 10.
- Returns
(Mean predictions [t, y], Upper/lower bounds of 2 std [t, y])
-
src.StockRNN module¶
Code to train the RNN
@author: Duncan Mazza
-
class
src.StockRNN.
StockRNN
(ticker: str, lstm_hidden_size: int = 100, lstm_num_layers: int = 2, to_compare: [<class 'str'>] = None, train_start_date: datetime.datetime = datetime.datetime(2017, 1, 1, 0, 0), train_end_date: datetime.datetime = datetime.datetime(2018, 1, 1, 0, 0), sequence_segment_length: int = 50, drop_prob: float = 0.3, device: str = 'cuda', auto_populate: bool = True, train_data_prop: float = 0.8, lr: float = 0.0001, train_batch_size: int = 10, test_batch_size: int = 4, num_workers: int = 2, label_length: int = 30, try_load_weights: bool = False, save_state_dict: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
Class for training on and predicting stocks using a LSTM network
-
__init__
(ticker: str, lstm_hidden_size: int = 100, lstm_num_layers: int = 2, to_compare: [<class 'str'>] = None, train_start_date: datetime.datetime = datetime.datetime(2017, 1, 1, 0, 0), train_end_date: datetime.datetime = datetime.datetime(2018, 1, 1, 0, 0), sequence_segment_length: int = 50, drop_prob: float = 0.3, device: str = 'cuda', auto_populate: bool = True, train_data_prop: float = 0.8, lr: float = 0.0001, train_batch_size: int = 10, test_batch_size: int = 4, num_workers: int = 2, label_length: int = 30, try_load_weights: bool = False, save_state_dict: bool = True)[source]¶ - Parameters
lstm_hidden_size – size of the lstm hidden layer
lstm_num_layers – number of layers for the lstm
ticker – ticker of company whose stock you want to predict
to_compare – ticker of companies whose stock will be part of the features of the dataset
train_start_date – date to request data from
train_end_date – date to request data to
sequence_segment_length – length of sequences to train the model on
drop_prob – probability for dropout layers
device – string for device to try sending the tensors to (i.e. “cuda”)
auto_populate – automatically calls all ‘populate’ functions in the constructor
train_data_prop – proportion of data set to allocate to training data
lr – learning rate for the optimizer
train_batch_size – batch size for the training data
:param test_batch_size:batch size for the testing data :param num_workers: parameter for Pytorch DataLoaders :param label_length: length of data (starting at the end of each sequence segment) to consider for the loss :param try_load_weights: boolean for whether the model should search for a cached model state dictionary :param save_state_dict: boolean for whether the model should cache its weights as a state dictionary
-
check_sliding_window_valid_at_index
(end_pred_index, pred_beyond_range)[source]¶ Checks that the index parameter for creating a distribution of predictions is valid for the dataset, and modifies it if it isn’t (as well as prints a warning describing the condition)
- Parameters
end_pred_index – index of the date that is desired to be predicted
pred_beyond_range – tuple containing the range of the number of forecasted days the model will use to
arrive at a prediction at
end_pred_index
:return: end_pred_index (modified if necessary)
-
do_training
(num_epochs: int, verbose=True, plot_output: bool = True, plot_output_figsize: (<class 'int'>, <class 'int'>) = (5, 10), plot_loss: bool = True, plot_loss_figsize: (<class 'int'>, <class 'int'>) = (7, 5))[source]¶ This method trains the network using data in
train_loader
and checks against the data intest_loader
at the end of each epoch. The forward pass through the network produces sequences of the same length as the input sequences. The sequences in the label data are of lengthlabel_length
, so the output sequences are cropped to lengthlabel_length
before being passed through the MSE loss function. Because each element of the output sequence at positionn
is a prediction of the input elementn+1
, the cropped windows of the output sequences are given by the window that terminates at the second-to-last element of the output sequence.- Parameters
num_epochs – number of epochs to to run the training for
verbose – if true, print diagnostic progress updates and final training and test loss
plot_output – if true, plot the results of the final pass through the LSTM with a randomly selected
segment of data :param plot_output_figsize:
figsize
argument for the output plot :param plot_loss: if true, plot the training and test loss :param plot_loss_figsize:figsize
argument for the loss plot
-
forward
(X: torch.Tensor, predict_beyond: int = 0)[source]¶ Completes a forward pass of data through the network. The tensor passed in is of shape (batch size, features, sequence length), and the output is of shape (batch size, 1, sequence length). The data is passed through a LSTM layer with an arbitrary number of layers and an arbitrary hidden size (as defined by
lstm_hidden_size
andlstm_num_layers
; the output is then passed through 2 fully connected layers such that the final number of features is the same as the input number of features (num_companies
)- Parameters
X – input matrix of data of shape: (batch size, features (number of companies), sequence length)
predict_beyond – number of days to recursively predict beyond the given input sequence
- Returns
output of the forward pass of the data through the network (same shape as input)
-
generate_predicted_distribution
(end_pred_index: int = None, pred_beyond_range: (<class 'int'>, <class 'int'>) = (1, 10))[source]¶ Returns a list of predicted stock values at a given date using a range of forecast lengths
- Parameters
end_pred_index – index of the date that is desired to be predicted
pred_beyond_range – tuple containing the range of the number of forecasted days the model will use to
arrive at a prediction at
end_pred_index
:return: list of predicted values (of length given bypred_beyond_range
) :return: actual stock value corresponding to the predictions
-
make_prediction_with_validation
(predict_beyond: int = 30, num_plots: int = 2, data_start_indices: numpy.ndarray = None)[source]¶ Selects data from the dataset and makes a prediction
predict_beyond
days out, and the actual values of the stock are shown alongside.- Parameters
predict_beyond – days to predict ahead in the future
data_start_indices – indices corresponding to locations in the total dataset sequence for the training
data to be gathered from (with the training data being of length
sequence_segment_length
) :return: length of the data being returned (training + prediction sequences) :return: datetime objects corresponding to data_start_indices :return: datetime objects corresponding to the end of the returned sequences :return: indices corresponding to the days where the predicted sequence starts :return: input and label sequence data associated with each pass of the model :return: numpy array of the model output :return: training data (in absolute stock value form instead of the % change that the model sees) :return: output prediction of the model converted from % change to actual stock values :return: label data (in absolute stock value form instead of % change) to compare to the output prediction :return: disparity between predicted stock values and actual stock values
-
peek_dataset
(figsize: (<class 'int'>, <class 'int'>) = (10, 5))[source]¶ Creates a simple line plot of the entire dataset
- Parameters
figsize – tuple of integers for
plt.subplots
figsize
argument
-
plot_predicted_distribution
(latest_data_index: int = None, pred_beyond_range: (<class 'int'>, <class 'int'>) = (1, 10))[source]¶ TODO: documentation
-
plot_prediction_with_validation
(predict_beyond: int = 30, num_plots: int = 5, plt_scl=20)[source]¶ A method for debugging/validating
make_prediction_with_validation
- makes predictions and shows the raw output of the model, reconstructed stock prices, and disparity between predicted stock prices and actual stock prices.- Parameters
predict_beyond – days to predict ahead in the future
num_plots – number of times to call
make_prediction_with_validation
and plot the results
- Plt_scl
integer for width and heigh parameters of matplotlib plot
-
populate_daily_stock_data
()[source]¶ Populates
self.daily_stock_data
with the day-over-day percent change of the closing stock prices. The data for each company is truncated such that each company’s array of data is the same length as the rest and such that their length is divisible by :attr:` sequence_segment_length`.
-
populate_loaders
()[source]¶ Populates
train_loader
,test_laoder
,train_loader_len
, and :attr:`test_loader_len attributes.
-
populate_test_train
(rand_seed: int = -1)[source]¶ Populates
self.train_data
andself.test_data
tensors with complimentary subsets of the sequences ofself.daily_stock_data
, where the sequences are theself.sequence_length
length sequences of data that, when concatenated, compriseself.daily_stock_data
.- Parameters
rand_seed – value to seed the random number generator; if -1 (or any value < 0), then do not seed the random number generator.
-
pred_in_conj
(start_of_pred_idx: int, n_days: int, pred_beyond_range: (<class 'int'>, <class 'int'>) = (1, 10))[source]¶ Calls :method:`generate_predicted_distribution` to create a list of predictions for each day given in a given range, and returns the mean and standard deviation associated with each day.
- Parameters
start_of_pred_index – integer corresponding to the first date whose distribution will be predicted
n_days – number of days from
start_of_pred_index
to predict outpred_beyond_range – tuple containing the range of the number of forecasted days the model will use to
arrive at a prediction at
end_pred_index
:return: list of lengthn_days
of the mean values associated with each day’s predicted stock :return: list of lengthn_days
of the standard deviation associated with each day’s predicted stock
-
src.get_data module¶
Utility functions for acquiring financial data
@author: Duncan Mazza
-
class
src.get_data.
Company
(ticker: str, start_date: datetime.datetime, end_date: datetime.datetime, call_populate_dataframe: bool = True, cache_bool: bool = True)[source]¶ Bases:
object
-
__init__
(ticker: str, start_date: datetime.datetime, end_date: datetime.datetime, call_populate_dataframe: bool = True, cache_bool: bool = True)[source]¶ TODO: documentation here
- Parameters
ticker –
start_date –
end_date –
call_populate_dataframe –
cache_bool –
-
cache
(file_path: str, data_frame: pandas.core.frame.DataFrame = None)[source]¶ Saves a DataFrame as a
.csv
to a path relative to the current working directory.- Parameters
file_path – path to save the DataFrame to; if not an absolute path, then it is used as a path
relative to the current working directory. :param data_frame: DataFrame to save (if not specified, will use
data_frame
(attribute)
-
get_date_at_index
(i)[source]¶ Returns the datetime object at index
- Parameters
i – index to return the date of
-
static
moving_average
(a, n, padding: bool = True)[source]¶ Calculates the moving average of a one-dimensional numpy array
a
, capable of utilizing a padding of lengthn - 1
at the beginning of the pre-filtered data so that the length is not truncated; the padding is populated with the same value as the first value of the % change sequence. :param a: one-dimensional numpy array :param n: length of rolling average filter :param padding: boolean for whether to utilize padding :return: filtered array
-
populate_dataframe
()[source]¶ Populates
data_frame
with stock data acquired using pandas_datareader.data. View more information here. Modifiesstart_date
,start_date_changed
,end_date
, andend_date_changed
ifstart_date
and/orend_date
are different than the actual start and end dates indata_frame
such thatstart_date
andend_date
equal the actual start and end dates indata_frame
(andstart_date_changed
andend_date_changed
reflect whetherstart_date
andend_date
were changed respectively).
-
reconstruct_stock_from_percent_change
(percent_change_vec: numpy.ndarray, initial_condition_index: int)[source]¶ Reconstruct the stock prices from percent change
- Parameters
percent_change_vec – vector of percent changes
initial_condition_index – index of initial condition for the % change
-
return_data
(ticker: str = None, start_date: datetime.datetime = None, end_date: datetime.datetime = None) → pandas.core.frame.DataFrame[source]¶ Returns the DataFrame containing the financial data for the prescribed company. This function will pull the data from the Yahoo API built into pandas_datareader if it has not been cached and will then cache the data, or it will read the data from the cached
csv
file. The cached files are named with the ticker, start date, and end dates that specify the API query, and exist in the.cache/
folder located under the current working directory.- Parameters
ticker – ticker string for the company whose data will be retrieved
start_date – start date for the data record
end_date – end date for the data record
- Returns
DataFrame of financial data
-
return_dummy_data
()[source]¶ Creates linear stock data as dummy data for testing a model
- Returns
numpy array of dummy data
-
return_numpy_array_of_company_daily_stock_close
() → numpy.ndarray[source]¶ Returns a numpy array of the “Close” column of
data_frame
.- Returns
numpy array of closing stock prices indexed by day
-
return_numpy_array_of_company_daily_stock_percent_change
(rolling_avg_length: int = 4) → numpy.ndarray[source]¶ Converts the numpy array of the closing stock data (acquired by calling :method:`return_numpy_array_of_company_daily_stock_close`) into an array of day-over-day percent change. Adds a value of 0 at the beginning of the array to maintain sequence length.
- Parameters
apply_rolling_avg – if nonzero, applies a rolling average filter to the percent change data (see
:method:`moving_average` for details on how the rolling average is calculated with padding) :return: numpy array of length 1 less than the array generated by :method:`return_numpy_array_of_company_daily_stock_close`
-
revise_end_date
(new_end_date: datetime.datetime)[source]¶ Modifies
data_frame
such that the last date of the data is equal tonew_end_date
(all following data is deleted).- Parameters
new_end_date – a datetime object of the new end date for
data_frame
(wherenew_end_date
exists and is unique inself.data_frame["Date"]
-
revise_start_date
(new_start_date: datetime.datetime)[source]¶ Modifies
data_frame
such that the starting date of the data is equal tonew_start_date
(all prior data is deleted).- Parameters
new_start_date – a datetime object of the new start date for
data_frame
(wherenew_start_date
exists and is unique inself.data_frame["Date"]
-