What a Neural Network Can Actually Learn
The Question
When people say a neural network is "learning" something, what does that mean geometrically? Not in the abstract — what is the actual shape of the function it's producing?
I'd read the explanations: more layers means more composition, more neurons means more capacity, ReLU activations produce piecewise-linear functions. But I wanted to see it. Watch a network fail to fit a curve because it doesn't have enough neurons. Watch a deep-but-narrow network collapse to a flat line. Compare them side by side, training in real time.
So I built a set of visualisations in Python and PyTorch to do exactly that.
The Target Curve
All the experiments try to fit the same target:
y = sin(x) + 0.5·sin(3x) + 0.3·cos(5x)
This is a superposition of three sine/cosine waves at different frequencies. It's smooth enough to be learnable, but wiggly enough that a weak network can't fake it. It's a standard toy regression target that makes architectural differences immediately visible.
200 evenly-spaced points from x = −3 to x = 3. No noise added — the challenge is purely model capacity, not data quality.
Experiment 1: Width vs Depth
The first experiment trains four architectures simultaneously on the same data, with a live animation showing each network's prediction curve evolving in real time alongside a shared loss plot:
- 1 neuron, 1 layer — essentially a single bent line
- 8 neurons, 1 layer — shallow but wide
- 1 neuron, 3 layers — deep but narrow
- 8 neurons, 3 layers — deep and wide
The results were mostly what I expected — except for one thing.
The deep-but-narrow network (1 neuron × 3 layers) was the worst performer by far — worse even than the single neuron. It barely moved from a flat line across the entire training run. Intuitively this makes sense: each layer is a single scalar transformation, so the whole chain is just a sequence of bent lines stacked on a single value. You can compose functions all you like, but if the signal is a single number at every layer, you can't represent anything complex.
The shallow-and-wide network (8 neurons × 1 layer) did surprisingly well — it captured the broad shape of the curve early, even if it couldn't nail the high-frequency wiggles. Width gives you parallel basis functions that sum together. One layer of 8 ReLUs can produce 8 different "bent line" pieces that, weighted correctly, approximate smooth bumps.
The deep-and-wide winner (8 neurons × 3 layers) didn't actually converge much better than the shallow-wide version given the training time — but the composition of multiple nonlinear layers means it could represent more complex relationships with more epochs. The loss curve told the real story: it descended fastest and most consistently.
Experiment 2: Activation Functions
The second experiment lets you define any architecture — layers, neuron counts, and activation function per layer — and watch it learn with a three-panel view: live prediction, loss curve, and a static block diagram of the architecture.
The supported activations are ReLU, Tanh, Sigmoid, LeakyReLU, ELU, GELU, and identity (no activation). GELU is the one used in most modern transformer architectures (BERT, GPT).
The most striking thing I noticed: switching from ReLU to GELU made convergence noticeably smoother on this task. ReLU networks had a tendency to plateau early and then jump — the loss curve had a jagged, stepped character. GELU curves descended more steadily. This matches what I'd read about GELU's smooth gradient near zero making optimisation more stable.
Sigmoid was the worst. It saturates badly — the gradients vanish for neurons that are very positive or very negative, which stalls learning. The loss curve barely moved for the first half of training.
Experiment 3: Peering Inside Layer by Layer
The third experiment is the most conceptually interesting. Instead of showing the final output, it renders one panel per hidden layer, each showing the activations of that layer's neurons as a function of x.
The mental model it forces:
output = L4( L3( L2( L1( x ) ) ) )
Each layer is a function. Each panel shows what that function currently looks like across the input range. You watch the network building up a representation step by step — Layer 1 learns something crude, Layer 2 transforms that into something more structured, and so on.
With 1-neuron layers (the deep-but-narrow case), each panel is a single line. You can see exactly why it fails — each transformation is still one-dimensional, so composing four of them doesn't add capacity, it just reshapes a flat curve.
With 4-neuron layers, each panel shows 4 colored lines — one per neuron. You can watch the parallel neurons specialising over training: one might learn to activate on the left side of the domain, another on the right, another near the centre. The final output is a weighted sum of whatever the last layer produces — essentially a learned basis decomposition.
What I Actually Learned
The most useful thing wasn't any specific result — it was having a direct visual channel to what the network is doing. Watching the loss curve plateau and then break free, or seeing a layer's activations suddenly reorganise during training, makes the abstract math feel grounded.
Two things I'll carry forward:
- Width before depth — for a fixed parameter budget, shallow-and-wide often outperforms deep-and-narrow on smooth regression tasks. Composition without width is just a chain of scalar reshapings.
- GELU for smooth problems — the smoother gradient profile genuinely helps on toy regression. It doesn't matter for simple tasks, but I'd default to GELU now for anything where I want stable convergence.
Next I want to run the same experiments on a classification boundary instead of a regression curve — to see how decision boundaries evolve during training in the same frame-by-frame way.