Pink Noise in Neural Nets: A Brief Experiment

Disclaimer: Some basic exposure to machine learning is assumed.

Neural nets are on the rise, now that computing power and parallel data processing capabilities have reached the levels that allow them to shine. Recurrent neural nets, the more sophisticated kind that possess time dynamics, have achieved spectacular results in certain areas. Overfitting, however, has remained a problem to watch out for—a problem that’s endemic to machine learning, in general. Although it cannot be fully solved in principle, a number of techniques emerged, aimed to reduce overfitting and keep it in check. None of those techniques are guaranteed to always work, because the problems neural nets are typically used for are NP-hard. Some are better suited to certain kinds of problems; others, to others.

Supervised machine learning, in general—and neural net models, in particular—ultimately consists in navigating a multidimensional landscape while seeking to minimize the error function. In other words, a neural net is looking for an error minimum on a complex “surface.” One of the dangers is that it may end up stuck at a local minimum that is grossly sub-optimal, missing a better minimum. This is called underfitting, but it’s not as common as overfitting. You see, the landscape itself is shaped by the training data fed into the model. The neural net may then discover a very good minimum (maybe, even the global one), but the said minimum may be peculiar to the training data sample—whereas a less optimal local minimum could prove to be more generic, applicable to more data. Thus, overfitting may result.

Noise Injection and Neural Darwinism

One of the techniques to fight both underfitting and overfitting is noise injection. The intuitive rationale is that, on one hand, adding noise may allow the model to jump over a modestly high error ridge out of a sub-optimal minimum, the kind that it would never have escaped on its own; on the other hand, even accidentally escaping a very good minimum can be beneficial if that allows the model to explore more of the landscape as it changes dynamically with more data fed into it.

Realistically, however, one could hardly expect error injection to work in simple, more or less determined situations, when the number of the minima is relatively small, and when the landscape doesn’t change dramatically over the course of learning. In cases like that, noise is more likely to be detrimental, only interfering with proper learning. Noise does help, however, in the situations more amenable to butterfly effect—that is to say, when arbitrarily small changes to the initial conditions can result in a significant divergence over time.

This is where the motivation behind noise injection intersects with neuroscience. At any time, whether awake or asleep—even unconscious—the brain is a blizzard of electrical activity, components of which follow the Gaussian distribution and thus qualify as noise. For a long time, this noise was thought to be detrimental to the brain’s “ideal” functioning, an unfortunate artifact of its analog nature. More recently, evidence began emerging that there is more to this noise than meets the eye.

Of course, neural nets themselves were inspired by the brain. Thus, it should come as no surprise that the noise in the brain could be performing the same function as the noise injection does in neural nets. But there is more to that. The brain’s electric noise may also serve as the source of “mutations” in the Neural Darwinism paradigm. Advanced by Gerald Edelman, who had received a Noble Prize for discovering that the immune system works by the evolutionary principle, the Neural Darwinism approach to neuroscience posits that the brain works by the evolutionary principle, as well.

Picture the brain as a massively parallel computing system, whose main goal is prediction—originally, of the results of active movement (which is why the brain evolved in animals, not plants). As a rather crude simplification, think of many neuronal ensembles trying their hand at predicting the same task—sort of like multiple “teams” working on the same problem. If those “teams” are even slightly different, they can produce quite different results—especially, if butterfly effect takes place, as is the case in most situations in the real life (which is why our brains are so much better at handling those when even today’s supercomputers, even while we are so much worse at doing simple math). The neuronal ensemble whose prediction matches the outcome more closely—the “winner,” one could say—gets its synaptic connections boosted, while the others tend to lose some strength. That said, the noise is what creates some variation between those “teams.”

That, too, has influenced design of neural nets. The technique known as “ensemble learning” combined multiple predictive models (not necessarily neural nets, in fact), pitting them against each other, while adjusting their weights in the ensemble depending on how well they match the expected outcomes. Thus, it seems appropriate to combine noise injection with ensemble learning, such that the ensemble would consist of the models produced by training the neural net of the same structure but with random noise.

Different Approaches to Noise Injection

Different approaches exist to noise injection into neural nets. One can distinguish between several flavors of noise injection. A better known one is called “dropout”; it consists in randomly dropping some neurons from the hidden layers, different ones on each backward propagation pass. Another well-known one is to inject noise into the weights on connections between neurons. But there is yet another one, not widely used: injecting noise into the activation values within neurons.

The dropout method is known to work successfully in a variety of situations. It’s often thought of as a variation of the noise injection into weights, as if the weights on the outgoing connections from the dropped neurons are randomly set to zero. However, this randomness is not applied independently to the entire set of weights in the neural net; indeed, all outgoing connections for the dropped neurons are affected equally. Thus, it is more natural to think of the dropout method as injecting noise into the activation values, if a threshold is applied.

This has a parallel with the brain’s inner workings (as, indeed, should be the case with any successful neural net technique). In the brain, a neuron fires if the total weighted sum of voltage input from the incoming synapses crosses a certain threshold. Random noise, when added to the voltage, can cause a sub-threshold signal to go over the threshold, or vice versa. However, this does not necessarily mean that the dropout method works only on those neural nets whose activation function is of the threshold type (that is, can only take two values: 0 or 1). Indeed, the dropout method works well in the situations when a neuron’s activation value can be any number between 0 and 1, although it has been tried primarily (to the best of my knowledge) with non-recurrent neural nets.

But if the dropout method works with neural nets without threshold activation functions, could injecting random noise directly into the activation values, without forcing them to zero, work too? This approach has been hardly ever tried before, so I attempted to apply it in one specific scenario, to see if it results in any improvement.

Neural Nets in Language Processing

No, it is not handwriting recognition, an example used in many neural net tutorials. It’s handled well enough by convolutional neural nets (a variant of plain feed-forward neural nets, where no set of neurons forms a loop); and it is no surprise, because it has no time dynamics. We are interested in a recurrent neural net, where the problem of overfitting is particularly strong.

So far, recurrent neural nets have done particularly well in language processing. There are two kinds of models for this sort of task: some try to predict the next word in a sequence; others do it for the next character. Clearly, given the presence of time dynamics (we are, after all, predicting elements in a sequence), recurrent neural nets are used for this kind of job. On each back-propagation pass through time, the next element in the sequence is used as input—and so on.

The adherents of word-based models argue that words represent concepts, which is the right level of granularity in speech, whereas characters are an artifact of a system of writing, which—at least, for the alphabetic ones—is more related to pronunciation than the concepts in speech. However, I prefer character-based models, for the following reasons:

The vocabulary to work with is much smaller (just the number of characters, instead of the number of all possible words).
If the model is to be able to write meaningful texts in your language of choice (English, in our case), the least it could do is learn the concept of a word and its implementation via blank space separators—moreover, what punctuation means, including balancing parentheses.
The words are actually learned by the model, rather than handed out in advance. If the words represent concepts, then it’s more interesting to see the model learn them from the given texts.
The model is free to invent new words; in fact, the relative appropriateness of such “neologisms” would speak highly of the model’s quality (and indeed, successful character-based models have come up with rather interesting examples).

For this experiment, I used the character-based model described in an excellent article on “Unreasonable Effectiveness of Recurrent Neural Nets” (where you can also find examples of the language it produced).The original author, Andrej Karpathy, developed it in Lua, but I use its port to TensorFlow. This is actually three models in one, for it can be run in three modes: normal neurons, LSTM cells, and GRU cells. Let’s briefly describe them.

LSTM Cells and Remembered Present

In the first case, each cell in any hidden layer is a normal neuron, whose activation value is computed by taking the weighted sum of the activation values from other neurons over all incoming connections and then applying the activation function.

LSTM (“long short-term memory”) cells have a more complicated structure. For a more detailed description, I will refer the reader to an article on “Understanding LSTMs.” In a simplified account, think of an LSTM cell as a structure that contains three neurons:

The “forget gate” decides which information from the hidden state tensor (accumulated with time) we are going to forget and to which extent (a number between 0 and 1 for each element of the state tensor, with 0 representing complete forgetting and 1 representing full retention).
The “input gate” decides which information in the hidden state tensor will be updated based on the new inputs (again, this is a number between 0 and 1 for each element of the state tensor, with 0 representing no update and 1 representing complete replacement).
The “output gate” produces the candidate values for the update, which are scaled down according to the output of the input gate and applied to the hidden state tensor after the forgetting step.

GRU (“gated recurrent unit”) is a variant of LSTM in which the input gate and the forget gate are fused into one, such that the update factors are complementary to the forget factors (their sum is always 1). As such, it is a simplification of LSTM, designed for better computational efficiency, but it may not always work sufficiently well to justify fusing the gates. As always with neural nets, there is no guarantee in any given case, so one must try all sorts of things to see what works.

It’s worth noting that LSTM cells have deep parallels with neuroscience, besides the obvious one in its name. I have already mentioned Gerald Edelman, in connection with Neural Darwinism. Another of his contributions to neuroscience is the concept of “remembered present,” which tells us that our brains construct our present model of the world not only—and even not mostly—from the current input but also from our memories. This applies not only to the emotions elicited by the current input in connection with the memories (and therefore, the decisions that we make based on those emotions), but also even to the visual image we create in our mind of what we see in front of our eyes—for computational efficiency, we fill in a lot of stuff that looks familiar with what we already know (although that does not apply to what we consciously focus on). This is also why memories are so fluid, because different “mutated” variants (note: Neural Darwinism again) may give a better fit in different circumstances in our present life. The LSTM’s forget-and-update mechanism essentially merges the new input with modified memories.

On Fractal Noise

I didn’t try multiple kinds of noise for this brief project but settled on one of them: pink noise. Taking a sample over a period of time, the distribution is Gaussian, but pink noise has a specific temporal dynamic. Namely, in the frequency domain, the signal’s amplitude is inversely proportional to frequency. This is a special case of colored (or fractal) noise, whose amplitude behaves as 1/f^d, where f is frequency and d is called the fractal dimension of the noise. Pink noise has the fractal dimension of 1, which represents the zone of complexity on the spectrum from complete chaos (white noise: d = 0) to the more structured kinds of noise (brown/red noise: d = 2, and higher).

I could have just used Gaussian noise, of course, as is common in neural nets. But I figured that, given the time dynamics inherent in recurrent neural nets, if I were to choose only one kind of noise for a brief project, then it had to be pink noise. After all, this is the kind of noise that’s present in the brain. Presumably, this more complex kind of noise should be better at navigating butterfly effect, although such a statement is hard to quantify and test. Interestingly, the noise present in newborns’ brains is closer to brown noise, whereas the noise in senile adults’ brains is closer to white.

Experiment Setup

As already mentioned, I used a character-based language model, written in Python to the TensorFlow API. TensorFlow is an open-source software library for dataflow programming, developed by Google. It models a program as a directed graph of computation tasks, with data flowing between them, optimizes the graph, and distributes the tasks, scheduling them across all available computing resources. This makes it especially attractive for such heavy computation programs as neural nets.

TensorFlow Python API already has classes available for representing hidden layers of plain neurons (RNNCell), of LSTM cells (LSTMCell), and of GRU cells (GRUCell). Their constructors accept the activation function as one of their arguments. However, in LSTMCell and GRUCell, the activation function passed in the constructor is only used in the output gate; the forget gate and the update gate (combined in GRUCell) use a hardcoded sigmoid function, instead. So I have extended LSTMCell and GRUCell (as NoisyLSTMCell and NoisyGRUCell, respectively) to have the choice of injecting noise into the activation values for all gates.

I have used the colorednoise Python package for generating pink noise. Strictly speaking, the time dynamics in recurrent neural nets arises from back-propagation in time, which means that the hidden layers of neurons (or LSTM/GRU cells) are “unrolled,” each subsequent group of layers representing the previous group of layers at the next time step—by necessity, with finite time. However, I decided not to assign a separate noise generator to every neuron, since any periodic sample from pink noise should also have the same amplitude-to-frequency profile, therefore remaining pink.

Thus, my activation function changed from the default tanh(x) to tanh(x + noise_factor * noise_fn), where noise_fn is the function that generates pink noise and noise_factor is a parameter used to scale the noise’s influence. The sigmoid function was similarly modified.

The hardware used for this brief project was quite modest: a Windows laptop, with Intel® Core™ i7-8750H CPU (2.2-4.2 GHz), 32GB of DDR4 @ 2666 MHz memory, NVIDIA® GeForce® GTX 1070 OC GPU with 8GB GDDR5. TensorFlow was used with CUDA support, to utilize the GPU’s processing capacity.

The full collection of William Shakespeare’s works was used for the data set, which was automatically divided by the Python code into the training, validation, and test data sets. The model was tried with 3 hidden layers of 512 neurons/LSTM/GRU each, as well as with 2 hidden layers of 128 neurons/LSTM/GRU each, with 50 training epochs. The default learning rate of 0.002 produced the best results, although several others were tried, too, to check whether noise injection may allow a faster learning rate to produce comparable results.

The word perplexity on the test set was used to measure how well the model does overall. The difference between the word perplexity on the training and validation sets over the training epochs was used as an indicator of overfitting: more divergence = more overfitting.

Summary of Results

In all cases, with and without noise, the LSTM mode was far superior. Surprisingly, the GRU mode was even worse than the plain neuron mode, so that I soon stopped using the GRU mode entirely. The main result, however, is that noise injection didn’t do much. Occasionally, a noisy model would produce a marginally better result (usually, with the noise factor set to 0.1), but more often, a slightly worse one (especially, at higher levels of the noise factor). Overall, it appears that it didn’t matter much whether noise was injected or not.

The divergence trends between the word perplexity on the training and validation sets indicate a mild degree of overfitting. Typically, the best or nearly the best model appeared around the epochs 15-20, after which the word perplexity degraded with further training, with another performance peak close to the epochs 35-40, after which it degraded again, the word perplexity on the validation set generally going up toward the end.

However, it is interesting to note that the model was quite insensitive not only to noise injection but also to other parameters. The word perplexity on the training set generally was in the neighborhood of 3.0-3.1 and in the neighborhood of 3.7-3.8 on the validation set, in the LSTM mode, pretty much regardless of the learning rate. Even the difference between the 2-layer and 3-layer models was not particularly dramatic (around 3.8/4.3 on training/validation sets for the 2-layer model).

Food for Thought

The observation above suggests a possible interpretation of the results of this experiment is that the model may have reached its limitations, at least for the limited dataset that it was trained on. It may also be that language processing, being more determined by the nature of language itself, is less amenable to butterfly effect and therefore is less helped by noise injection. Perhaps, more sophisticated neural net models, designed for situations such as high-degree-of-freedom multi-limb articulated movement over a random landscape with obstacles may be more receptive to noise injection, but I’m not aware of any models of that sort to be publicly available (if at all developed), at this time.

It is also possible that another approach to neural net design, of the kind that more closely approximates the way the brain works, may be more successful in general—and with noise injection, in particular. For example, all neurons in the brain have thresholds to fire, instead of scaling their signal; but the model that we tried does not work at all with threshold activation functions.

Another major difference from the brain is that the neurons in the brain don’t all fire at the same time. Indeed, they have different rates of fire, many neurons changing it depending on various factors, thus producing very complex patterns in space and time, the kind that no existing neural net has even approached yet—although there is a promising approach under development, called clockwork neural nets. In these, different groups of neurons fire at different rates; however, all firing remains driven by the same clock, with different multipliers applied to different groups, and the spatial organization of the differently firing groups is still too regular. And there are other differences with the brain (the neuron’s refractory period, just to name one).

Please note that ensemble learning was not tried for this brief experiment. In part, this is because the most interesting technique of ensemble learning, Bayesian model combination, is not currently implemented in TensorFlow, to the best of my knowledge. Bayesian model combination (BMC) differs from Bayesian model averaging (BMA) in that BMC samples from the space of possible ensembles instead of sampling each model in the ensemble individually. This allows it to fix the known vulnerability of BMA, which causes it to converge to a single model in the ensemble, thus defeating the purpose.

As you can see, there is a lot of ground yet to cover, more kinds of models to try, with different neural net design and kinds of noise. In conclusion, I invite everyone to try their own hand in this research.

Some Select Resources

h5.meta { display: none; }