Catastrophic Forgetting Is Not an Optimization Problem

Introduction

The strange thing a neural network does

I was doing nothing useful when the thought arrived. Not running an experiment. Not reading a paper. Just sitting with the accumulated weight of six failed experiments in a row, seventeen through twenty-two, each of which had confirmed with increasing precision that my architecture was being beaten by a frozen ViT-B backbone that had learned nothing new at all.

A pretrained model, doing nothing, outperforming everything I built. That should have been discouraging. It felt diagnostic. Zero is information. A frozen model cannot forget, and for six experiments that had been enough to win. The question was not why my architecture was failing. The question was what that failure was made of.

To understand what I mean, consider the simplest version of the problem. You train a neural network to recognize cats. You push accuracy to ninety-four percent. Then you train the same network on dogs. Then you test it on cats again.

The accuracy does not degrade gracefully. It does not slip to eighty percent, then seventy, settling into some reasonable compromise. It collapses. Often to near-chance levels. The network that knew cats no longer knows them. Not partially, not fuzzily. The knowledge is simply gone. This is catastrophic forgetting, and it has been one of the central unsolved problems in machine learning since the late 1980s.^[1,2]

The disorienting fact about it, the thing that took me months to fully absorb, is that nothing was erased. The weights were not wiped. The training data for cats was not deleted. The network's capacity did not shrink. What happened is subtler and in some ways more troubling: the weights moved. They drifted into a new region of the high-dimensional weight space, a region optimized for dogs, and there is no force pulling them back. The cat valley still exists in the landscape. The network is simply no longer in it.

Forgetting is not erasure. It is displacement. The knowledge still exists, encoded in the shape of a loss surface the network can no longer reach.

The simulation below makes this concrete. The 3D surface is the loss landscape, projected down to two weight dimensions for visibility. Blue is the cat task surface. Green is the dog task surface. The glowing orange dot is the network's current weight position. Train on cats, and the dot rolls into the blue valley. Then train on dogs. Watch what happens — not just to the accuracy numbers on the right, but to where the dot ends up. Then rotate the scene and notice that the blue valley is still there, intact, in the landscape. The dot is just somewhere else.

Simulation 1 · dual loss landscape — weight displacement during sequential training

drag to rotate

3D weight space · 2-dim projection

Cat feature weights

Dog feature weights

Press Train on cats to begin. Watch the orange dot settle into the blue valley as cat accuracy climbs. Then press Train on dogs and observe the displacement.

Notice what you just watched. The dog training did not attack the cat weights. There was no adversarial process. The optimizer followed the gradient for dogs, and the gradient for dogs has no information about the cat surface. The forgetting is a byproduct of indifference, not malice. The loss surface for cats remains geometrically intact after dog training ends. The network is simply somewhere else in weight space.

This distinction between erasure and displacement is not semantic. It is the key to understanding why every major approach to catastrophic forgetting has been attacking the wrong target for thirty years. If forgetting were erasure, you would protect the weights. If forgetting is displacement, you need to understand the geometry of the space well enough to prevent the displacement from happening in the first place.

Background

What continual learning actually requires

To appreciate why catastrophic forgetting matters beyond benchmark performance, consider what any intelligent system deployed in the real world actually has to do. A robot encounters new environments, new objects, new manipulation tasks. A medical AI sees new disease variants, new imaging modalities, new patient populations. A language model needs to incorporate new knowledge without losing command of what it already knows. All of these systems must learn continuously across time, from a non-stationary stream of experience, without access to all past data simultaneously.

This is the continual learning problem. It is harder than it sounds, and not because of computational constraints. It is hard because of a mathematical tension built into how gradient-based learning works: the same parameter update that makes you better at the new task can make you worse at everything else. The tension is not an engineering artefact that better hardware resolves. It is fundamental.^[1,3]

To make this precise: a neural network is a function parameterized by a weight vector w. Training on Task A finds a w that minimizes L_A(w). Training on Task B then minimizes L_B(w). Unless the optima of L_A and L_B are in exactly the same region of weight space, sequential gradient descent on L_B will move w away from the L_A optimum. The distance it moves depends on the gradient magnitude, the similarity of the tasks, and the geometry of the loss surfaces. When that distance is large enough that the network is no longer in the basin of attraction for L_A, forgetting is catastrophic.

What makes the problem interesting from a research perspective is that there exists, somewhere in weight space, a region that is simultaneously good for both tasks. It is not empty. The two loss surfaces were constructed from the same world, encoded in overlapping feature spaces. The question is whether gradient descent, operating sequentially on one task at a time, can ever find that joint optimum without explicit access to both tasks simultaneously. The answer, in general, is no. And understanding exactly why that answer is no turns out to require understanding a geometry that most treatments of continual learning never examine directly.

Why replay does not solve the problem

The most intuitive response to catastrophic forgetting is experience replay: store some examples from Task A and include them in the Task B training batch. This works up to a point. It reduces the drift by mixing gradients from both tasks. But it does not change the underlying geometry. The joint optimum is still where it is. Replay helps you approximate joint training, but the approximation degrades as the number of tasks grows and the buffer becomes a smaller fraction of the total training data. It is a practical workaround, not a solution to the geometric problem.^[3]

The simulation below gives you a bird's-eye view of what the network is actually navigating. Fifty neurons are shown as a 3D grid. Each one activates at some level for each task. You can see Task A's activation pattern in blue, Task B's in green. The red neurons are the ones both tasks rely on. When you train on Task B, those red neurons get updated to serve Task B. That update necessarily degrades their encoding of Task A. This is where forgetting lives, not in any single weight, but in the shared representation neurons that two tasks cannot both use simultaneously.

Simulation 2 · neuron activation volume — where forgetting lives in the network

drag to rotate

3D neuron grid · 5×5×4 = 100 neurons

Select a mode to explore which neurons activate for each task. The red neurons in overlap mode are the interference zone: they cannot serve both tasks simultaneously with a single weight value.

The interference zone is not the whole story, but it is where most of the damage happens. A neuron that fires strongly for both "cat ear shape" and "dog snout curvature" must encode both features in a single weight vector. When Task B training updates that neuron to better encode dog snouts, it loses precision on cat ears. The loss is proportional to how similar the two feature demands are, and how many neurons are shared.

In a small network, the shared zone is a large fraction of the total. In a large network like ViT-B/14 with 768-dimensional representations, it is a smaller fraction. But the key question is not the absolute size of the shared zone. It is whether the architecture gives the network any way to learn Task B outside the shared zone. And that question turns out to depend on a concept that most continual learning research ignores entirely.

III

The stability-plasticity dilemma

A constraint imposed by geometry, not by engineering

The stability-plasticity dilemma is the oldest framing of the continual learning problem.^[4] A system that is perfectly stable cannot learn anything new: its weights are fixed. A system that is perfectly plastic forgets everything instantly: each new update overwrites the last. Every biological and artificial learning system lives somewhere between these extremes, and the question is how to navigate that space well.

The standard framing presents this as a tradeoff: you sacrifice some plasticity to gain stability, or vice versa. Turn up the learning rate and the system learns fast but forgets fast. Turn it down and it retains old knowledge but struggles to acquire new knowledge quickly. This framing is intuitive and partially correct. But it misses the deeper structure of the problem.

The real issue is not that stability and plasticity pull in opposite directions along a single axis. The real issue is that they are geometrically incompatible under single-pathway gradient descent. To be plastic, the weights must move in response to new data. To be stable, the weights must not move away from old solutions. If new data pushes the weights in a direction that overlaps with the old solution's basin, you cannot satisfy both constraints simultaneously with a single gradient step. No learning rate, no regularizer, and no loss function resolves this. The conflict is structural.

The brain does not resolve this conflict. It sidesteps it entirely, by using two separate learning systems that operate at different timescales and never directly compete for the same weights.^[5] The hippocampus is the fast learner: it encodes new episodes rapidly, one-shot if necessary, in sparse representations. The neocortex is the slow learner: it integrates experience over thousands of repetitions into dense, overlapping representations. Critically, the hippocampus does not write directly to the neocortex in real time. Instead, during sleep, it replays compressed versions of the day's experiences, and the neocortex updates slowly in response to these replays, interleaved with all its prior experience. The result is that the slow system never receives a gradient dominated by a single new task. It always sees a mixture.

The simulation below lets you explore the stability-plasticity surface directly. The vertical axis is effective performance, measured across both old and new tasks simultaneously. The two horizontal axes are plasticity and stability. The surface has a ridge: there is a region of joint parameter space where the system performs well on both dimensions. Outside that ridge, performance collapses. The brain's operating point sits on the ridge. Standard neural network training with sequential gradient descent does not.

Simulation 3 · stability-plasticity surface — the geometry of the dilemma

drag to rotate

stability × plasticity → performance surface

Select a system to see where it sits on the stability-plasticity surface. Notice how standard SGD achieves high plasticity but near-zero stability, while frozen models achieve high stability but zero plasticity. The brain and residual adapters find the ridge.

Several things become clear from this surface. First, the ridge is narrow. The joint optimum requires precise coordination between plasticity and stability, not just a moderate value of each. Second, the standard engineering response, tuning a single learning rate or regularization coefficient, moves you along the surface but cannot get you onto the ridge unless the architecture itself creates the separation. Third, and most importantly, the brain's solution is not on the stability-plasticity tradeoff at all. It is above it, on the ridge, because it uses architectural separation rather than parameter tuning to achieve both simultaneously.

This is the first architectural lesson. The stability-plasticity dilemma is not a constraint you optimize through. It is a constraint you escape from, by building a system that does not have to make the tradeoff in the first place. The CLS theory tells us the brain's solution. The question for machine learning is whether we can build an analogous structure.

The complementary learning systems theory

McClelland, McNaughton and O'Reilly (1995) proposed that the brain uses two distinct memory systems at fundamentally different timescales: the hippocampus for rapid encoding of new episodes in sparse, pattern-separated representations, and the neocortex for slow, distributed consolidation of structured knowledge.^[5] The hippocampus can learn a new fact in a single exposure. The neocortex requires thousands of interleaved exposures before it incorporates a new pattern without disrupting old ones. Crucially, McClelland et al. demonstrate through simulation that fast neocortical learning produces catastrophic interference, while slow interleaved learning does not — the timescale separation is not incidental but mechanistically essential.

The mechanism that makes this work is offline replay. During sleep, the hippocampus re-activates compressed representations of recent experiences and replays them to the cortex. The cortex receives these replays interleaved with activations of existing knowledge, ensuring that the gradient it sees is always a mixture, never dominated by any single new task. A subsequent review by O'Reilly et al. (2014) confirmed this picture with two decades of additional neuroscientific evidence, particularly regarding sharp-wave ripple events during hippocampal replay.^[7]

The machine learning field has known about CLS theory for thirty years. It has mostly failed to take the architectural implication seriously: the solution is not a better loss function. It is structural separation of fast and slow learning pathways, with an explicit mechanism for interleaved consolidation.

The sections that follow trace what happens when you take the geometric and architectural lessons seriously. They move from the abstract stability-plasticity surface to the specific geometry of weight space, to the concept of the nullspace, and finally to the phase transition that determines whether forgetting is gradual or catastrophic. Each section is accompanied by a simulation that lets you develop direct intuition for the mathematics involved.

Notes and references — sections I through III

[1]McCloskey, M. and Cohen, N. J. (1989). "Catastrophic interference in connectionist networks: the sequential learning problem." In G. H. Bower (ed.), The Psychology of Learning and Motivation, vol. 24, pp. 109–165. Academic Press. doi:10.1016/S0079-7421(08)60536-8.

[2]Ratcliff, R. (1990). "Connectionist models of recognition memory: constraints imposed by learning and forgetting functions." Psychological Review, 97(2), 285–308. doi:10.1037/0033-295x.97.2.285.

[3]French, R. M. (1999). "Catastrophic forgetting in connectionist networks." Trends in Cognitive Sciences, 3(4), 128–135. doi:10.1016/S1364-6613(99)01294-2.

[4]Grossberg, S. (1980). "How does a brain build a cognitive code?" Psychological Review, 87(1), 1–51. doi:10.1037/0033-295X.87.1.1.

[5]McClelland, J. L., McNaughton, B. L., and O'Reilly, R. C. (1995). "Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory." Psychological Review, 102(3), 419–457. doi:10.1037/0033-295X.102.3.419.

[6]Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. (2017). "Overcoming catastrophic forgetting in neural networks." Proceedings of the National Academy of Sciences, 114(13), 3521–3526. doi:10.1073/pnas.1611835114.

[7]O'Reilly, R. C., Bhattacharyya, R., Howard, M. D., and Ketz, N. (2014). "Complementary learning systems." Cognitive Science, 38(6), 1229–1248. doi:10.1111/j.1551-6709.2011.01214.x.

Continuing in this essay · Sections IV through VIII

The geometry underneath the gradient — nullspace, phase transitions, and what experiment 023 actually proved

The next sections move from the stability-plasticity framing to the specific geometry of high-dimensional weight space. They introduce the nullspace concept, explain why the 768-to-128 bottleneck destroyed it, and show how the residual adapter's Pareto improvement is direct empirical evidence for the geometric hypothesis.