Where to begin. I guess I should start with an update about some of the projects I’ve been working on recently… First, the Earthquake Predictor results that can be found at cognoscitive.com/earthquakepredictor are just over a year out of date. I still need to update the dataset to include the past year’s earthquakes from the USGS, but I’ve been busy first using the existing data as a benchmark to test some changes to the loss function and architecture that I want to utilize. I’m still debating whether to continue using an asymmetrical loss like Exp Or Log, or Smooth L1 Or L2, or to switch to the symmetric Smooth L1, which would reduce false positives substantially. My original reason for an asymmetric loss was to encourage the model to make higher magnitude predictions, but I worry that it makes the model too eager to guess everywhere that earthquakes are frequent, rather than being more discriminating.
Music-RNN has run into a weird problem where I’m having difficulty reproducing the results that I got with the old Torch library a few years ago with the Keras port. It’s probably because the Keras version isn’t stateful, but it could also be that some of the changes I made to improve the model have backfired for this task, so I need to do some ablation studies to check. My modification for Vocal Style Transfer is on hold until then.
In other news, a couple of neat projects I’ve been trying include: Lotto-RNN, and Stock-RNN.
Lotto-RNN is a silly attempt to predict Lotto Max numbers on the theory that some of them, like the Maxmillions draws, are pseudorandom because they are done by computer rather than ball machine, and thus might be predictable… Alas, so far no luck. Or rather, the results so far are close to chance. I’m probably not going to spend more time on this long shot…
Stock-RNN is a slightly more serious attempt to predict future daily price deltas of the S&P500 given previous daily price deltas. It uses the same online stateful architecture that seemed to work best for the Earthquake Predictor before. The average result of ten different initializations is about +9% annual yield, which falls below the +10.6% that you’d get from just buying and holding the index over the same period. Technically, the best individual model result achieved +14.9%, but I don’t know if that’s a fluke and won’t regress to the mean.
I also tried a stateless model for Stock-RNN, but it performed much worse. There are some things I can do to adjust this project. For instance, I could modify the task to try to predict the annual price delta instead, and train it on many stocks instead of just the S&P500 index, and use it to pick stocks for a year rather than just guess where to buy or sell daily. Alternatively, I could try to find a news API for headlines and use word vectors to convert them into features for the model.
On the research front, I was also able to confirm that the output activation function I originally named Topcat, does seem to work, and doesn’t require the loss function modifications that I’d previously thought were necessary, but works if you use it with binary crossentropy in place of softmax and categorical crossentropy. I still need to confirm the results on more tasks though before I can seriously consider publishing the result somewhere. There’s actually a few variants, mainly two different formulas and various modifications that seem to be functional.
It also looks like a hidden activation function I was working on that I named Iris also seems to work better than tanh. (Edit: More testing is required before I can be confident enough to say that.) Like with Topcat, I have several variants of this as well that I need to decide between.
Another thing that seems to help is scaling the norm of the gradients of an RNN, rather than just clipping the norm as is standard. Previously, I’d thought that setting the scaling coefficient to the Golden Ratio worked best, but my more recent tests suggest that 1.0 works better. Again, it’s something I need to double check on more tasks.
Some things that turned out to not work reliably better than the control include: LSTM-LITE, my tied weights variant of the LSTM, my naively and incorrectly implemented version of Temporal Attention for sequence-to-sequence models, and a lot of the places where I used the Golden Ratio to scale things. The formula for Iris does have a relation to Metallic Ratios, but it’s not as simple as scaling tanh by the Golden Ratio, which weirdly works on some small nets, but doesn’t scale well. Interestingly, the Golden Ratio is very close to the value suggested to scale tanh in this thread on Reddit about SELU. So, it’s possible that that would be theoretical justification for it. Otherwise, I was at a loss as to why that seemed to work sometimes.
I’m also preparing to finally upgrade my training pipeline. In the past I’ve used Keras 2.0.8 with the Theano 1.0.4 backend in Python 2.7. This was originally what I learned to use when I was at Maluuba, and conveniently was still useful at Huawei, for reasons related to the Tensorflow environment of the NPU. But, it’s way out of date now, so I’m looking at Tensorflow 2.1 and PyTorch 1.4. An important requirement is that the environment needs to be deterministic, and Tensorflow 2.1 introduced better determinism, while it’s been in PyTorch for several versions now.
I’ve used both Tensorflow and PyTorch in the past at work, though most of my custom layers, activations, and optimizers are written in Keras. Tensorflow 2.0+ incorporates Keras, so in theory, it should be easier to switch to that without having to rewrite all the customizations, but just adjust the import statements.
I’ve also switched to Python 3, as Python 2 has apparently reached end-of-life. Mostly, this is requires some small changes to my code, like replacing xrange with range, and possibly paying attention to / versus // in terms of division of integers.
One thing I’ve realized is that my research methodology in the past was probably not rigorous enough. It’s regrettable, but the reality is that I wasted a lot of experiments and explorations by not setting the random seeds and ensuring determinism before.
Regardless, I’m glad that at least some of my earlier results have been confirmed, although there are some mysterious issues still. For instance, the faulty Temporal Attention layer shouldn’t work, but in some cases it still improves performance over the baseline, so I need to figure out what exactly it’s actually doing.
In any case, that’s mostly what I’ve been up to lately on the research projects front…