Fast Compressed Neural Network For R Evaluation

DRAFT - Feeback is welcome

My findings in this research reinforce what was found here:

I've done a general analysis of the FCNN4R package for R. 

The source for it is located here if you are interrested in playing with the code.

The scripts do the following.

  • Download 6 datapoints per stock that are freely available on the web (It takes a stocklist file as input) (no max on # of stocks)
  • generates a neural net utilizing the FCNN4R package
  • Trains that neural net utilizing the parameters you pick (Or cycles through a set of them if you tweak and run the "Explore Models" script
  • spits out the parameters utilized, the MSE, and a terrible performance evaluation I mocked up quickly.(how much money would I lose using this model?)

Not all models utilize all of the parameters so, some fo the data points are filler obviously.

output data set sample:



Next, I generated 84k+ models and ran the evaluations/best MSE out of each of them.  Below are the results.

So the first analysis I did was just Mean Squared Error vs a lame performance measure which I havn't spent much time on... basically how much money would this trading model lose if it was used.

As you can see below the Simulated Annealing had the best general performance in terms of Mean Squared Error and my silly little performance function I wrote.

The training set had 1767 observations and 60 variables and the cross validation set had 590 observations 

The nets were a 60 - X - 10 structure:  60 input nodes, variable number of hidden layer nodes(cause I wanted to see the impact),  and 10 output nodes.  

If you haven't read up on these three neural net training algorythms I suggest you read wikipedia.  It helps quite a bit.

So below you can see that the Backpropogation nets performed the worst, and also showed a general tendancy as they over trained to just generate the best 'average' value that any given record should have for their output node. i.e. it would never avoid a total failure situation, it just hedged all scenarios.  you would never see an out put of zero from any of the output nodes.

The SGD algorythm did a much better job because of its batching of records during the learning process which I'll go into later.  but still showed the same averaging behavior.

The Simulated Annealing did the best as far as MSE and my general performance function.

MSE distribution by Neural Net Algorythm used for tuning


In order to demonstrate how the L2Reg parameter impacts the performance of the SGD algorythm I focused in and color coded on the various values I tested.  I believe this chart is pretty self explanitory.

As you get the value lower the MSE goes down but the generalizability of the function also goes down.

Distribution by L2Reg

This is the chart I found to be the most interesting.  I filtered out the backpropogation samples and down to what I felt were some "Good" parameters for the SGD (i.e. L2reg of .2)

The impact of hidden layer size is pretty obvious from this chart.  The bigger your hidden layer, the better the MSE.  There is also clearly a steady loss of Retern on investment as well.  I would love to plot the curve and try to determine the relationship between hidden layer size, and the ROI loss as the layer expands.  

By Hidden Layer Size

For the Simulated Annealing algorythm There was an additonal observation.... It trains much faster and shows little to no advantage with larger size nets....  So I was wondering how low can I go and still get some value?

Turns out the ration needs to be about two to one for hidden layer to input layer.  Anything over that for SA and you don't get much value, and anything underr that and you start seeing almost random results...

I went down to a 60-10-10 net and it was total trash.  60-40-10 still had some make it onto this graph, and you can see it improves after that, but the placement is so overlapped it's hard to distinguish much of a benefit over between a 60-120-10 net and a 60-200-10 net.

Simulated Annealing distribution by hidden layer size

I put this last chart together so that I could explor the minibatch size parameter on the SGD algorythm for training.  If you drop it to 1, you minas well be doing back propogation.  

As you can see from the chart, the smaller the batch, the better the MSE, the less generalizable it is.

The larger batch sizes don't get to the lowest MSE, but they perform better in other dimensions that I'm not going to go into here and haven't explored yet.

Stochastic Gradient Descent mini batch size impact on MSEIf you have questions/comments/suggestions, please feel free to join the unicorninvesting github or sign up on the site and email me.

The more the merrier.

SA BP and SGD algorithm performance

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.